|
|
2023
-
ShRing: networking with shared receive rings
Boris Pismenny, Adam Morrison, Dan Tsafrir
OSDI '23: USENIX Symposium on Operating Systems Design and Implementation
July,
2023,
Boston, MA,
pages 949–968
Abstract ,
BibTeX ,
PDF ,
Definitive
Multicore systems parallelize to accommodate incoming Ethernet
traffic, allocating one receive (Rx) ring with >=1Ki entries per
core by default. This ring size is sufficient to absorb packet
bursts of single-core workloads. But the combined size of all Rx
buffers (pointed to by all Rx rings) can exceed the size of the
last-level cache. We observe that, in this case, NIC and CPU memory
accesses are increasingly served by main memory, which might incur
nonnegligible overheads when scaling to hundreds of incoming
gigabits per second.
To alleviate this problem, we propose "shRing," which shares each Rx
ring among several cores when networking memory bandwidth
consumption is high. ShRing thus adds software synchronization
costs, but this overhead is offset by the smaller memory footprint.
We show that, consequently, shRing increases the throughput of NFV
workloads by up to 1.27x, and that it reduces their latency by up to
38x. The substantial latency reduction occurs when shRing shortens
the per-packet processing time to a value smaller than the packet
interarrival time, thereby preventing overload conditions.
@InProceedings{pismenny23-shring,
author |
= |
{Boris Pismenny and Adam Morrison and Dan Tsafrir}, |
title |
= |
{{ShRing}: networking with shared receive rings}, |
booktitle |
= |
{USENIX Symposium on Operating Systems Design and Implementation (OSDI)}, |
year |
= |
2023, |
month |
= |
{July}, |
pages |
= |
{949--968}, |
address |
= |
{Boston, MA} |
}
-
Degrading data to save the planet
Aviad Zuck, Donald E. Porter, Dan Tsafrir
HotOS '23: ACM Workshop on Hot Topics in Operating Systems
June,
2023,
Providence, RI,
pages 61–69
Abstract ,
BibTeX ,
PDF ,
Definitive
Storage capacity demand is projected to grow exponentially in the
coming decade and so will its contribution to the overall carbon
footprint of computing devices. In recent years, cloud providers and
device vendors have substantially reduced their carbon impact
through improved power consumption and product distribution.
However, by 2030, the manufacturing of flash-based storage devices
will account for 1.7% of carbon emissions in the world. Therefore,
reducing production-related carbon emissions of storage is key to
sustainability in computing devices.
We present Sustainability-Oriented Storage (SOS), a new host-device
co-design for personal storage devices, which opportunistically
improves storage sustainability by: (1) targeting widely-produced
flash-based personal storage devices; (2) reducing hardware
production through optimizing bit density in existing materials, up
to 50
effective lifespan of personal devices and longer lifespan of their
underlying flash.
SOS automatically stores low-priority files, occupying most personal
storage capacities on high-density flash memories, currently
designated for nearline storage. To avoid data loss, low-priority
files are allowed to slightly degrade in quality over time.
Switching to high-density memories, which maximize production
material utilization, reduces the overall carbon footprint of
personal storage devices.
@InProceedings{zuck23-ssdcarbon,
author |
= |
{Aviad Zuck and Donald E. Porter and Dan Tsafrir}, |
title |
= |
{Degrading data to save the planet}, |
booktitle |
= |
{ACM Workshop on Hot Topics in Operating Systems (HotOS)}, |
year |
= |
2023, |
month |
= |
{June}, |
pages |
= |
{61--69}, |
address |
= |
{Providence, RI} |
}
2022
-
Optimizing storage performance with calibrated interrupts
Amy Tai, Igor Smolyar, Michael Wei, Dan Tsafrir
TOS: ACM Transactions on Storage
February,
2022,
pages 3:1–3:32,
volume 18,
number 1
Abstract ,
BibTeX ,
PDF ,
Definitive
After request completion, an I/O device must decide whether to
minimize latency by immediately firing an interrupt or to optimize
for throughput by delaying the interrupt, anticipating that more
requests will complete soon and help amortize the interrupt
cost. Devices employ adaptive interrupt coalescing heuristics that
try to balance between these opposing goals. Unfortunately, because
devices lack the semantic information about which I/O requests are
latency-sensitive, these heuristics can sometimes lead to disastrous
results.
Instead, we propose addressing the root cause of the heuristics
problem by allowing software to explicitly specify to the device if
submitted requests are latency-sensitive. The device then
“calibrates” its interrupts to completions of latency-sensitive
requests. We focus on NVMe storage devices and show that it is
natural to express these semantics in the kernel and the application
and only requires a modest two-bit change to the device
interface. Calibrated interrupts increase throughput by up to 35%,
reduce CPU consumption by as much as 30%, and achieve up to 37%
lower latency when interrupts are coalesced.
@Article{tai22-cintj,
author |
= |
{Amy Tai and Igor Smolyar and Michael Wei and Dan Tsafrir}, |
title |
= |
{Optimizing storage performance with calibrated interrupts}, |
journal |
= |
{ ACM Transactions on Storage (TOS)}, |
volume |
= |
18, |
number |
= |
1, |
year |
= |
2022, |
month |
= |
{February}, |
pages |
= |
{3:1--3:32} |
}
-
The benefits of general-purpose on-NIC memory
Boris Pismenny, Liran Liss, Adam Morrison, Dan Tsafrir
ASPLOS '22: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
February,
2022,
Lausanne, Switzerland,
pages 1130–1147
Abstract ,
BibTeX ,
PDF ,
Definitive
We propose to use the small, newly available on-NIC memory
(“nicmem”) to keep pace with the rapidly increasing performance of
NICs. We motivate our proposal by accelerating two types of workload
classes: NFV and key-value stores. As NFV workloads frequently
operate on headers—rather than data—of incoming packets, we
introduce a new packet-processing architecture that splits between
the two, keeping the data on nicmem when possible and thus reducing
PCIe traffic, memory bandwidth, and CPU processing time. Our
approach consequently shortens NFV latency by up to 23% and
increases its throughput by up to 19%. Similarly, because key-value
stores commonly exhibit skewed distributions, we introduce a new
network stack mechanism that lets applications keep frequently
accessed items on nicmem. Our design shortens memcached latency by
up to 43% and increases its throughput by up to 80%.
@InProceedings{pismenny22-nicmem,
author |
= |
{Boris Pismenny and Liran Liss and Adam Morrison and Dan Tsafrir}, |
title |
= |
{The benefits of general-purpose {on-NIC} memory}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2022, |
month |
= |
{February}, |
pages |
= |
{1130--1147}, |
address |
= |
{Lausanne, Switzerland} |
}
2021
-
Rowhammering storage devices
Tao Zhang, Boris Pismenny, Donald E. Porter, Dan Tsafrir,
Aviad Zuck
HotStorage '21: ACM Workshop on Hot Topics In Storage and File Systems
July,
2021,
Virtual event
Abstract ,
BibTeX ,
PDF ,
Definitive
Peripheral devices like SSDs are growing more complex, to the point
they are effectively small computers themselves. Our position is
that this trend creates a new kind of attack vector, where untrusted
software could use peripherals strictly as intended to accomplish
unintended goals. To exemplify, we set out to rowhammer the DRAM
component of a simplified SSD firmware, issuing regular I/O requests
that manage to flip bits in a way that triggers sensitive
information leakage. We conclude that such attacks might soon be
feasible, and we argue that systems need principled approaches for
securing peripherals against them.
@InProceedings{zhang21-rowhssd,
author |
= |
{Tao Zhang and Boris Pismenny and Donald E. Porter and Dan Tsafrir and
Aviad Zuck}, |
title |
= |
{Rowhammering storage devices}, |
booktitle |
= |
{ACM Workshop on Hot Topics In Storage and File Systems (HotStorage)}, |
year |
= |
2021, |
month |
= |
{July}, |
address |
= |
{Virtual event} |
}
-
Optimizing storage I/O with calibrated interrupts
Amy Tai, Igor Smolyar, Michael Wei, Dan Tsafrir
OSDI '21: USENIX Symposium on Operating Systems Design and Implementation
July,
2021,
pages 129–145
Abstract ,
BibTeX ,
PDF ,
Definitive
After request completion, an I/O device must decide either to
minimize latency by immediately firing an interrupt or to optimize
for throughput by delaying the interrupt, anticipating that more
requests will complete soon and help amortize the interrupt cost.
Devices employ adaptive interrupt coalescing heuristics that try to
balance between these opposing goals. Unfortunately, because devices
lack the semantic information about which I/O requests are
latency-sensitive, these heuristics can sometimes lead to disastrous
results.
Instead, we propose addressing the root cause of the heuristics
problem by allowing software to explicitly specify to the device if
submitted requests are latency-sensitive. The device then
“calibrates” its interrupts to completions of latency-sensitive
requests. We focus on NVMe storage devices and show that it is
natural to express these semantics in the kernel and the application
and only requires a modest two-bit change to the device interface.
Calibrated interrupts increase throughput by up to 35%, reduce
CPU consumption by as much as 30%, and achieve up to 37%
lower latency when interrupts are coalesced.
@InProceedings{tai21-cinter,
author |
= |
{Amy Tai and Igor Smolyar and Michael Wei and Dan Tsafrir}, |
title |
= |
{Optimizing storage {I/O} with calibrated interrupts}, |
booktitle |
= |
{USENIX Symposium on Operating Systems Design and Implementation (OSDI)}, |
year |
= |
2021, |
month |
= |
{July}, |
pages |
= |
{129--145} |
}
-
Dealing with (some of) the fallout from meltdown
Nadav Amit, Michael Wei, Dan Tsafrir
SYSTOR '21: ACM International Conference on Systems and Storage
June,
2021,
Virtual event,
pages 1–6
Abstract ,
BibTeX ,
PDF ,
Definitive
The meltdown vulnerability allows users to read kernel memory by
exploiting a hardware flaw in speculative execution. Processor
vendors recommend “page table isolation” (PIT) as a software
fix, but PTI can significantly degrade the performance of
system-call-heavy programs. Leveraging the fact that 32-bit pointers
cannot access 64-bit kernel memory, we propose “Shrink,” a safe
alternative to PTI, which is applicable to programs capable of
running in 32-bit address spaces. We show that Shrink can restore
the performance of some workloads, suggest additional potential
alternatives, and argue that vendors must be more open about
hardware flaws to allow developers to design protection schemes that
are safe and performant.
@InProceedings{amit21-melt,
author |
= |
{Nadav Amit and Michael Wei and Dan Tsafrir}, |
title |
= |
{Dealing with (some of) the fallout from meltdown}, |
booktitle |
= |
{ACM International Conference on Systems and Storage (SYSTOR)}, |
year |
= |
2021, |
month |
= |
{June}, |
pages |
= |
{1--6}, |
address |
= |
{Virtual event} |
}
-
Characterizing, exploiting, and detecting DMA code
injection vulnerabilities in the presence of an IOMMU
Alex Markuze, Shay Vargaftik, Gil Kupfer, Boris Pismenny,
Nadav Amit, Adam Morrison, Dan Tsafrir
EuroSys '21: ACM European Conference on Computer Systems
April,
2021,
pages 395–409
Abstract ,
BibTeX ,
PDF ,
Definitive
Direct memory access (DMA) renders a system vulnerable to DMA
attacks, in which I/O devices access memory regions not intended
for their use. Hardware input-output memory management units
(IOMMU) can be used to provide protection. However, an IOMMU
cannot prevent all DMA attacks because it only restricts DMA at
page-level granularity, leading to sub-page vulnerabilities.
Current DMA attacks rely on simple situations in which write
access to a kernel pointer is obtained due to sub-page
vulnerabilities and all other attack ingredients are available and
reside on the same page. We show that DMA vulnerabilities are a
deep-rooted issue and it is often the kernel design that enables
complex and multistage DMA attacks. This work presents a
structured top-down approach to characterize, exploit, and detect
them.
To this end, we first categorize sub-page vulnerabilities into
four types, providing insight into the structure of DMA
vulnerabilities. We then identify a set of three vulnerability
attributes that are sufficient to execute code injection attacks.
We built analysis tools that detect these sub-page vulnerabilities
and analyze the Linux kernel. We found that 72% of the device
drivers expose callback pointers, which may be overwritten by a
device to hijack the kernel control flow.
Aided by our tools' output, we demonstrate novel code injection
attacks on the Linux kernel; we refer to these as compound
attacks. All previously reported attacks are single-step, with the
vulnerability attributes present in a single page. In compound
attacks, the vulnerability attributes are initially
incomplete. However, we demonstrate that they can be obtained by
carefully exploiting standard OS behavior.
@InProceedings{markuze21-ioatk,
author |
= |
{Alex Markuze and Shay Vargaftik and Gil Kupfer and Boris Pismenny and
Nadav Amit and Adam Morrison and Dan Tsafrir}, |
title |
= |
{Characterizing, exploiting, and detecting {DMA} code
injection vulnerabilities in the presence of an {IOMMU}}, |
booktitle |
= |
{ACM European Conference on Computer Systems (EuroSys)}, |
year |
= |
2021, |
month |
= |
{April}, |
pages |
= |
{395--409} |
}
-
Autonomous NIC offloads
Boris Pismenny, Haggai Eran, Aviad Yehezkel, Liran Liss,
Adam Morrison, Dan Tsafrir
ASPLOS '21: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
April,
2021,
Detroit, MI,
pages 18–35,
awarded Best Paper
Abstract ,
BibTeX ,
PDF ,
Definitive
CPUs routinely offload to NICs network-related processing tasks
like packet segmentation and checksum. NIC offloads are
advantageous because they free valuable CPU cycles. But their
applicability is typically limited to layer≤4 protocols (TCP
and lower), and they are inapplicable to layer-5 protocols
(L5Ps) that are built on top of TCP. This limitation is caused
by a misfeature we call “offload dependence,” which dictates
that L5P offloading additionally requires offloading the
underlying layer≤4 protocols and related functionality: TCP,
IP, firewall, etc. The dependence of L5P offloading hinders
innovation, because it implies hard-wiring the complicated,
ever-changing implementation of the lower-level protocols.
We propose “autonomous NIC offloads,” which eliminate offload
dependence. Autonomous offloads provide a lightweight
software-device architecture that accelerates L5Ps without
having to migrate the entire layer≤4 TCP/IP stack into the
NIC. A main challenge that autonomous offloads address is coping
with out-of-sequence packets. We implement autonomous offloads
for two L5Ps: (i) NVMe-over-TCP zero-copy and CRC computation,
and (ii) https authentication, encryption, and decryption. Our
autonomous offloads increase throughput by up to 3.3x, and they
deliver CPU consumption and latency that are as low as 0.4x and
0.7x, respectively. Their implementation is already upstreamed
in the Linux kernel, and they will be supported in the
next-generation of Mellanox NICs.
@InProceedings{pismenny21-l5o,
author |
= |
{Boris Pismenny and Haggai Eran and Aviad Yehezkel and Liran Liss and
Adam Morrison and Dan Tsafrir}, |
title |
= |
{Autonomous {NIC} offloads}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2021, |
month |
= |
{April}, |
pages |
= |
{18--35}, |
address |
= |
{Detroit, MI} |
}
2020
-
Predicting execution times with partial simulations in
virtual memory research: why and how
Mohammad Agbarya, Idan Yaniv, Jayneel Gandhi,
Dan Tsafrir
MICRO '20: IEEE/ACM International Symposium on Microarchitecture
October,
2020,
Global Online Event,
pages 456–470
Abstract ,
BibTeX ,
PDF ,
Definitive
Computer architects frequently use cycle-accurate simulations,
which incur heavy overheads. Recently, virtual memory studies
increasingly employ a lighter-weight methodology that utilizes
partial simulations—of only the memory subsystem—whose output
is fed into a mathematical linear model that predicts execution
runtimes. The latter methodology is much faster, but its accuracy
is only assumed, never rigorously validated.
We question the assumption and put it to the test by developing
Mosalloc, the Mosaic Memory Allocator. Mosalloc backs the virtual
memory of applications with arbitrary combinations of 4KB, 2MB,
and 1GB pages (each combination forms a “mosaic” of
pages). Previous studies used a single page size per execution
(either 4KB or 2MB) to generate exactly two execution samples,
which defined the aforementioned linear model. In contrast,
Mosalloc can generate many samples, allowing us to test instead of
assume the model's accuracy. We find that prediction errors of
existing models can be as high as 25%–192%. We propose a new
model that bounds the maximal error below 3%, making it more
reliable and useful for exploring new ideas.
@InProceedings{agbarya20-mosalloc,
author |
= |
{Mohammad Agbarya and Idan Yaniv and Jayneel Gandhi and
Dan Tsafrir}, |
title |
= |
{Predicting execution times with partial simulations in
virtual memory research: why and how}, |
booktitle |
= |
{IEEE/ACM International Symposium on Microarchitecture (MICRO)}, |
year |
= |
2020, |
month |
= |
{October}, |
pages |
= |
{456-470}, |
address |
= |
{Global Online Event} |
}
-
RAIDP: replication with intra-disk parity for
cost-effective storage of warm data
Eitan Rosenfeld, Aviad Zuck, Nadav Amit, Michael Factor,
Dan Tsafrir
EuroSys '20: ACM European Conference on Computer Systems
April,
2020,
Heraklion, Crete, Greece,
pages 26.1–26.17
Abstract ,
BibTeX ,
PDF ,
Definitive
Distributed storage systems often triplicate data to reduce the
risk of permanent data loss, thereby tolerating at least two
simultaneous disk failures at the price of 2/3 of the capacity. To
reduce this price, some systems utilize erasure coding. But this
optimization is usually only applied to cold data, because erasure
coding might hinder performance for warm data.
We propose RAIDP—a new point in the distributed storage design
space between replication and erasure coding. RAIDP maintains only
two replicas, rather than three or more. It increases durability
by utilizing small disk “add-ons” for storing intra-disk erasure
codes that are local to the server but fail independently from the
disk. By carefully laying out the data, the add-ons allow RAIDP to
recover from simultaneous disk failures (add-ons can be stacked to
withstand an arbitrary number of failures). RAIDP retains much of
the benefits of replication, trading off some performance and
availability for substantially reduced storage requirements,
networking overheads, and their related costs. We implement RAIDP
in HDFS, which triplicates by default. We show that baseline RAIDP
achieves performance close to that of HDFS with only two replicas,
and performs within 21
update-oriented variant, while halving the storage and networking
overheads and providing similar durability.
@InProceedings{rosenfeld20-raidp,
author |
= |
{Eitan Rosenfeld and Aviad Zuck and Nadav Amit and Michael Factor and
Dan Tsafrir}, |
title |
= |
{{RAIDP}: replication with intra-disk parity for
cost-effective storage of warm data}, |
booktitle |
= |
{ACM European Conference on Computer Systems (EuroSys)}, |
year |
= |
2020, |
month |
= |
{April}, |
pages |
= |
{26.1--26.17}, |
address |
= |
{Heraklion, Crete, Greece} |
}
-
IOctopus: outsmarting nonuniform DMA
Igor Smolyar, Alex Markuze, Boris Pismenny, Haggai Eran,
Gerd Zellweger, Austin Bolen, Liran Liss, Adam Morrison, Dan
Tsafrir
ASPLOS '20: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
March,
2020,
Lausanne, Switzerland,
pages 101–115,
awarded Best Paper
Abstract ,
BibTeX ,
PDF ,
Definitive
In a multi-CPU server, memory modules are local to the CPU to
which they are connected, forming a nonuniform memory access
(NUMA) architecture. Because non-local accesses are slower than
local accesses, the NUMA architecture might degrade application
performance. Similar slowdowns occur when an I/O device issues
nonuniform DMA (NUDMA) operations, as the device is connected to
memory via a single CPU. NUDMA effects therefore degrade
application performance similarly to NUMA effects.
We observe that the similarity is not inherent but rather a
product of disregarding the intrinsic differences between I/O and
CPU memory accesses. Whereas NUMA effects are inevitable, we show
that NUDMA effects can and should be eliminated. We present
IOctopus, a device architecture that makes NUDMA impossible by
unifying multiple physical PCIe functions—one per CPU—in
manner that makes them appear as one, both to the system software
and externally to the server. IOctopus requires only a modest
change to the device driver and firmware. We implement it on
existing hardware and demonstrate that it improves throughput and
latency by as much as 2.7× and 1.28×, respectively,
while ridding developers from the need to combat (what appeared to
be) an unavoidable type of overhead.
@InProceedings{smolyar20-oct,
author |
= |
{Igor Smolyar and Alex Markuze and Boris Pismenny and Haggai Eran and
Gerd Zellweger and Austin Bolen and Liran Liss and Adam Morrison and Dan
Tsafrir}, |
title |
= |
{{IOctopus}: outsmarting nonuniform {DMA}}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2020, |
month |
= |
{March}, |
pages |
= |
{101--115}, |
address |
= |
{Lausanne, Switzerland} |
}
2019
-
Refreshing ATC
Dan Tsafrir, Dahlia Malkhi
ATC '19: USENIX Annual Technical Conference
July,
2019,
Renton, WA,
pages xi–xxviii
BibTeX ,
PDF ,
Definitive
@InProceedings{tsafrir19-refreshATC,
author |
= |
{Dan Tsafrir and Dahlia Malkhi}, |
title |
= |
{Refreshing {ATC}}, |
booktitle |
= |
{USENIX Annual Technical Conference (ATC)}, |
year |
= |
2019, |
month |
= |
{July}, |
pages |
= |
{xi--xxviii}, |
address |
= |
{Renton, WA} |
}
-
Apps can quickly destroy your mobile's flash: why they
don't, and how to keep it that way
Tao Zhang, Aviad Zuck, Donald E. Porter, Dan Tsafrir
MobiSys '19: ACM International Conference on Mobile Systems, Applications,
and Services
June,
2019,
Seoul, South Korea,
pages 207–221
Abstract ,
BibTeX ,
PDF ,
Definitive
Although flash cells wear out, a typical SSD has enough cells and
sufficiently sophisticated firmware that its lifetime generally
exceeds the expected lifetime of its host system. Even under heavy
use, SSDs last for years and can be replaced upon failure. On a
smartphone, in contrast, the hardware is more limited and we show
that, under heavy use, one can easily, and more quickly, wear out
smartphone flash storage. Consequently, a simple, unprivileged,
malicious application can render a smartphone unbootable
(“bricked”) in a few weeks with no warning signs to the user. This
bleak result becomes more worrisome when considering the fact that
smartphone users generally believe it is safe to try out new
applications. To combat this problem, we study the I/O behavior of
a wide range of Android applications. We find that high-volume
write bursts exist, yet none of the applications we checked
sustains an average write rate that is high enough to damage the
device (under reasonable usage assumptions backed by the
literature). We therefore propose a rate-limiting algorithm for
write activity that (1) prevents such attacks, (2) accommodates
“normal” bursts, and (3) ensures that the smartphone drive
lifetime is longer than a preconfigured lower bound (i.e., its
warranty). In terms of user experience, our design only requires
that, in the worst case of an app that issues continuous,
unsustainable, and unusual writes, the user decides whether to
shorten the phone's life or rate limit the problematic app.
@InProceedings{zhang19-fbrick,
author |
= |
{Tao Zhang and Aviad Zuck and Donald E. Porter and Dan Tsafrir}, |
title |
= |
{Apps can quickly destroy your mobile's flash: why they
don't, and how to keep it that way}, |
booktitle |
= |
{ACM International Conference on Mobile Systems, Applications,
and Services (MobiSys)}, |
year |
= |
2019, |
month |
= |
{June}, |
pages |
= |
{207--221}, |
address |
= |
{Seoul, South Korea} |
}
-
Storm: a fast transactional dataplane for remote data structures
Stanko Novakovic, Yizhou Shan, Aasheesh Kolli, Michael
Cui, Yiying Zhang, Haggai Eran, Boris Pismenny, Liran Liss,
Michael Wei, Dan Tsafrir, Marcos Aguilera
SYSTOR '19: ACM International Conference on Systems and Storage
June,
2019,
Haifa, Israel,
pages 97–108,
awarded Best Paper
Abstract ,
BibTeX ,
PDF ,
Definitive
RDMA technology enables a host to access the memory of a remote
host without involving the remote CPU, improving the performance
of distributed in-memory storage systems. Previous studies
argued that RDMA suffers from scalability issues, because the
NIC's limited resources are unable to simultaneously cache the
state of all the concurrent network streams. These concerns led
to various software-based proposals to reduce the size of this
state by trading off performance. We revisit these proposals
and show that they no longer apply when using newer RDMA NICs in
rack-scale environments. In particular, we find that one-sided
remote memory primitives lead to better performance as compared
to the previously proposed unreliable datagram and kernel-based
stacks. Based on this observation, we design and implement
Storm, a transactional dataplane utilizing one-sided read and
write-based RPC primitives. We show that Storm outperforms eRPC,
FaRM, and LITE by 3.3x, 3.6x, and 17.1x, respectively, on an
InfiniBand cluster with Mellanox ConnectX-4 NICs.
@InProceedings{novakovic19-dc,
author |
= |
{Stanko Novakovic and Yizhou Shan and Aasheesh Kolli and Michael
Cui and Yiying Zhang and Haggai Eran and Boris Pismenny and Liran Liss and
Michael Wei and Dan Tsafrir and Marcos Aguilera}, |
title |
= |
{Storm: a fast transactional dataplane for remote data structures}, |
booktitle |
= |
{ACM International Conference on Systems and Storage (SYSTOR)}, |
year |
= |
2019, |
month |
= |
{June}, |
pages |
= |
{97--108}, |
address |
= |
{Haifa, Israel} |
}
-
Why and how to increase SSD performance transparency
Aviad Zuck, Tao Zhang, Philipp Guhring, Donald E. Porter, Dan Tsafrir
HotOS '19: ACM Workshop on Hot Topics in Operating Systems
May,
2019,
Bertinoro, Italy,
pages 192–200
Abstract ,
BibTeX ,
PDF ,
Definitive
Even on modern SSDs, I/O scheduling is a first-order performance
concern. However, it is unclear how best to optimize I/O patterns
for SSDs, because a complex layer of proprietary firmware hides
many principal aspects of performance, as well as SSD
lifetime. Losing this information leads to research papers drawing
incorrect conclusions about prototype systems, as well as
real-world systems realizing sub-optimal performance and
lifetime. It is our position that a useful performance model of a
foundational system component is essential, and the community
should support efforts to construct models of SSD performance. We
show examples from the literature and our own measurements that
illustrate serious limitations of current SSD modeling tools and
disk statistics. We observe an opportunity to resolve this problem
by reverse engineering SSDs, leveraging recent trends toward
component standardization within SSDs. This paper presents a
feasibility study and initial results to reverse engineer a
commercial SSD's firmware, and discusses limitations and open
problems.
@InProceedings{zuck19-frvrs,
author |
= |
{Aviad Zuck and Tao Zhang and Philipp Guhring and Donald E. Porter and Dan Tsafrir}, |
title |
= |
{Why and how to increase {SSD} performance transparency}, |
booktitle |
= |
{ACM Workshop on Hot Topics in Operating Systems (HotOS)}, |
year |
= |
2019, |
month |
= |
{May}, |
pages |
= |
{192--200}, |
address |
= |
{Bertinoro, Italy} |
}
2018
-
DAMN: overhead-free IOMMU protection for networking
Alex Markuze, Igor Smolyar, Adam Morrison, Dan Tsafrir
ASPLOS '18: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
March,
2018,
Williamsburg, VA,
pages 301–315
Abstract ,
BibTeX ,
PDF ,
Definitive
DMA operations can access memory buffers only if they are
“mapped” in the IOMMU, so operating systems protect themselves
against malicious/errant network DMAs by mapping and unmapping
each packet immediately before/after it is DMAed. This approach
was recently found to be riskier and less performant than keeping
packets non-DMAable and instead copying their content to/from
permanently-mapped buffers. Still, the extra copy hampers
performance of multi-gigabit networking.
We observe that achieving protection at the DMA (un)map boundary
is needlessly constraining, as devices must be prevented from
changing the data only after the kernel reads it. So there is no
real need to switch ownership of buffers between kernel and device
at the DMA (un)mapping layer, as opposed to the approach taken by
all existing IOMMU protection schemes. We thus eliminate the extra
copy by (1) implementing a new allocator called DMA-Aware Malloc
for Networking (DAMN), which (de)allocates packet buffers from a
memory pool permanently mapped in the IOMMU; (2) modifying the
network stack to use this allocator; and (3) copying packet data
only when the kernel needs it, which usually morphs the
aforementioned extra copy into the kernel's standard copy
operation performed at the user-kernel boundary. DAMN thus
provides full IOMMU protection with performance comparable to that
of an unprotected system.
@InProceedings{markuze18-damn,
author |
= |
{Alex Markuze and Igor Smolyar and Adam Morrison and Dan Tsafrir}, |
title |
= |
{{DAMN}: overhead-free {IOMMU} protection for networking}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2018, |
month |
= |
{March}, |
pages |
= |
{301--315}, |
address |
= |
{Williamsburg, VA} |
}
-
Stash in a flash
Aviad Zuck, Yue Li, Jehoshua Bruck, Donald E. Porter, Dan Tsafrir
FAST '18: USENIX Conference on File and Storage Technologies
February,
2018,
Oakland CA,
pages 169–188
Abstract ,
BibTeX ,
PDF ,
Definitive
Encryption is a useful tool to protect data confidentiality. Yet
it is still challenging to hide the very presence of encrypted,
secret data from a powerful adversary. This paper presents a new
technique to hide data in flash by manipulating the voltage level
of pseudo-randomlyselected flash cells to encode two bits (rather
than one) in the cell. In this model, we have one “public” bit
interpreted using an SLC-style encoding, and extract a private bit
using an MLC-style encoding. The locations of cells that encode
hidden data is based on a secret key known only to the hiding
user.
Intuitively, this technique requires that the voltage level in a
cell encoding data must be (1) not statistically distinguishable
from a cell only storing public data, and (2) the user must be
able to reliably read the hidden data from this cell. Our key
insight is that there is a wide enough variation in the range of
voltage levels in a typical flash device to obscure the presence
of fine-grained changes to a small fraction of the cells, and that
the variation is wide enough to support reliably re-reading hidden
data. We demonstrate that our hidden data and underlying voltage
manipulations go undetected by support vector machine based
supervised learning which performs similarly to a random guess.
The error rates of our scheme are low enough that the data is
recoverable months after being stored. Compared to prior work, our
technique provides 24x and 50x higher encoding and decoding
throughput and doubles the capacity, while being 37x more power
efficient.
@InProceedings{zuck18-flashdeny,
author |
= |
{Aviad Zuck and Yue Li and Jehoshua Bruck and Donald E. Porter and Dan Tsafrir}, |
title |
= |
{Stash in a flash}, |
booktitle |
= |
{USENIX Conference on File and Storage Technologies (FAST)}, |
year |
= |
2018, |
month |
= |
{February}, |
pages |
= |
{169--188}, |
address |
= |
{Oakland CA} |
}
2017
-
Preserving hidden data with an ever-changing disk
Aviad Zuck, Udi Shriki, Donald E. Porter, Dan Tsafrir
HotOS '17: ACM Workshop on Hot Topics in Operating Systems
May,
2017,
Whistler, Canada,
pages 50–55
Abstract ,
BibTeX ,
PDF ,
Definitive
This paper presents a storage system that can hide the presence of
hidden data alongside a larger volume of public data. Encryption
allows a user to hide the contents of data, but not the fact that
sensitive data is present. Under duress, the owner of high-value
data can be coerced by a powerful adversary to disclose decryption
keys. Thus, private users and corporations have an interest in
hiding the very presence of some sensitive data, alongside a larger
body of less sensitive data (e.g., the operating system and other
benign files); this property is called plausible
deniability. Existing plausible deniability systems do not fulfill
all of the following requirements: (1) resistance to multiple
snapshot attacks where an attacker compares the state of the device
over time; (2) ensuring that hidden data won't be destroyed when
the public volume is modified by a user unaware of the hidden data;
and (3) disguising writes to secret data as normal system
operations on public data.
We explain why existing solutions do not meet all these
requirements and present the Ever-Changing Disk (ECD), a generic
scheme for plausible deniability storage systems that meets all of
these requirements. An ECD stores hidden data inside a large volume
of pseudorandom data. Portions of this volume are periodically
migrated in a log-structured manner. Hidden writes can then be
interchanged with normal firmware operations. The expected access
patterns and time until hidden data is overwritten are completely
predictable, and insensitive to whether data is hidden. Users
control the rate of internal data migration (R), trading write
bandwidth to hidden data for longevity of the hidden data. For a
typical 2TB disk and setting of R, a user preserves hidden data by
entering her secret key every few days or weeks.
@InProceedings{zuck17-everchng,
author |
= |
{Aviad Zuck and Udi Shriki and Donald E. Porter and Dan Tsafrir}, |
title |
= |
{Preserving hidden data with an ever-changing disk}, |
booktitle |
= |
{ACM Workshop on Hot Topics in Operating Systems (HotOS)}, |
year |
= |
2017, |
month |
= |
{May}, |
pages |
= |
{50--55}, |
address |
= |
{Whistler, Canada} |
}
-
Flash drive lifespan *is* a problem
Tao Zhang, Aviad Zuck, Donald E. Porter, Dan Tsafrir
HotOS '17: ACM Workshop on Hot Topics in Operating Systems
May,
2017,
Whistler, Canada,
pages 42–49
Abstract ,
BibTeX ,
PDF ,
Definitive
When flash was introduced, wear-out was a known problem. Over
time, a number of techniques have been developed to estimate the
expected number of program/erase cycles under typical usage
patterns, and sufficiently over-provision the cells such that the
device meets its expected lifespan, even if individual cells fail.
This paper started as a simple experiment: measuring whether the
lifespan of flash devices in smartphones and other mobile devices,
match the estimates. To our surprise, we find that, in a matter of
days, simple, unprivileged applications can render the drive of
several smartphones (and thus, the phone) inoperable. This result
is concerning, as it means that installing malicious or
poorly-written software could destroy the device itself. We
experimentally demonstrate the problem, discuss reasons why it
occurs, and consider potential solutions.
@InProceedings{zhang17-flifespan,
author |
= |
{Tao Zhang and Aviad Zuck and Donald E. Porter and Dan Tsafrir}, |
title |
= |
{Flash drive lifespan *is* a problem}, |
booktitle |
= |
{ACM Workshop on Hot Topics in Operating Systems (HotOS)}, |
year |
= |
2017, |
month |
= |
{May}, |
pages |
= |
{42--49}, |
address |
= |
{Whistler, Canada} |
}
-
Page fault support for network controllers
Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro,
Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit,
Dan Tsafrir
ASPLOS '17: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
April,
2017,
Xi'an, China,
pages 449–466
Abstract ,
BibTeX ,
PDF ,
Definitive
Direct network I/O allows network controllers (NICs) to expose
multiple instances of themselves, to be used by untrusted software
without a trusted intermediary. Direct I/O thus frees researchers
from legacy software, fueling studies that innovate in multitenant
setups. Such studies, however, overwhelmingly ignore one serious
problem: direct memory accesses (DMAs) of NICs disallow page
faults, forcing systems to either pin entire address spaces to
physical memory and thereby hinder memory utilization, or resort to
APIs that pin/unpin memory buffers before/after they are DMAed,
which complicates the programming model and hampers performance.
We solve this problem by designing and implementing page fault
support for InfiniBand and Ethernet NICs. A main challenge we
tackle—unique to NICs—is handling receive DMAs that trigger
page faults, leaving the NIC without memory to store the incoming
data. We demonstrate that our solution provides all the benefits
associated with “regular” virtual memory, notably (1) a simpler
programming model that rids users from the need to pin, and (2) the
ability to employ all the canonical memory optimizations, such as
memory overcommitment and demand-paging based on actual use. We
show that, as a result, benchmark performance improves by up to
1.9x.
@InProceedings{lesokhin17-npf,
author |
= |
{Ilya Lesokhin and Haggai Eran and Shachar Raindel and Guy Shapiro and
Sagi Grimberg and Liran Liss and Muli Ben-Yehuda and Nadav Amit and
Dan Tsafrir}, |
title |
= |
{Page fault support for network controllers}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2017, |
month |
= |
{April}, |
pages |
= |
{449--466}, |
address |
= |
{Xi'an, China} |
}
-
Hardware and software support for virtualization
[book]
Edouard Bugnion, Jason Nieh, Dan Tsafrir
Morgan & Claypool Publishers
Synthesis Lectures on Computer Architecture,
February,
2017,
pages 1–206,
volume 12,
number 1
Abstract ,
BibTeX ,
Definitive
This textbook focuses on the core question of the necessary
architectural support provided by hardware to efficiently run
virtual machines, and of the corresponding design of the hypervisors
that run them. Virtualization is still possible when the
instruction set architecture lacks such support, but the hypervisor
remains more complex and must rely on additional techniques.
Despite the focus on architectural support in current architectures,
some historical perspective is necessary to appropriately frame the
problem. The first half of the text provides the historical
perspective of the theoretical framework developed four decades ago
by Popek and Goldberg. It also describes earlier systems that
enabled virtualization despite the lack of architectural support in
hardware.
As is often the case, theory defines a necessary—but not
sufficient—set of features, and modern architectures are the
result of the combination of the theoretical framework with insights
derived from practical systems. The second half of the text
describes state-of-the-art support for virtualization in both x86-64
and ARM processors. This textbook includes an in-depth description
of the CPU, memory and I/O virtualization of these two processor
architectures, as well as case studies on the Linux/KVM, VMware, and
Xen hypervisors. It concludes with a performance comparison of
virtualization on current-generation x86- and ARM-based systems
across multiple hypervisors.
@Book{bugnion17-virtbook,
author |
= |
{Edouard Bugnion and Jason Nieh and Dan Tsafrir}, |
title |
= |
{Hardware and software support for virtualization}, |
publisher |
= |
{Morgan & Claypool Publishers}, |
series |
= |
{Synthesis Lectures on Computer Architecture}, |
volume |
= |
12, |
number |
= |
1, |
year |
= |
2017, |
month |
= |
{February}, |
pages |
= |
{1--206} |
}
2016
-
Hash, don't cache (the page table)
Idan Yaniv, Dan Tsafrir
SIGMETRICS '16: ACM SIGMETRICS International Conference on Measurement and
Modeling
June,
2016,
Antibes Juan-les-Pins, France,
pages 337–350
Abstract ,
BibTeX ,
PDF ,
Definitive
Radix page tables as implemented in the x86-64 architecture incur
a penalty of four memory references for address translation upon
each TLB miss. These 4 references become 24 in virtualized setups,
accounting for 5%–90% of the runtime and thus motivating chip
vendors to incorporate page walk caches
(PWCs). Counterintuitively, an ISCA'10 paper found that radix page
tables with PWCs are superior to hashed page tables, yielding up
to 5x fewer DRAM accesses per page walk. We challenge this finding
and show that it is the result of comparing against a suboptimal
hashed implementation—that of the Itanium architecture. We show
that, when carefully optimized, hashed page tables in fact
outperform existing PWC-aided x86-64 hardware, shortening
benchmark runtimes by 1%–27% and 6%–32% in bare-metal and
virtualized setups, without resorting to PWCs. We further show
that hashed page tables are inherently more scalable than radix
designs and are better suited to accommodate the ever increasing
memory size; their downside is that they make it more challenging
to support such features as superpages.
@InProceedings{yaniv16-hvsr,
author |
= |
{Idan Yaniv and Dan Tsafrir}, |
title |
= |
{Hash, don't cache (the page table)}, |
booktitle |
= |
{ACM SIGMETRICS International Conference on Measurement and
Modeling (SIGMETRICS)}, |
year |
= |
2016, |
month |
= |
{June}, |
pages |
= |
{337--350}, |
address |
= |
{Antibes Juan-les-Pins, France} |
}
-
Synopsis of the ASPLOS '16 wild and crazy ideas (WACI) invited-speakers session
Dan Tsafrir
ASPLOS '16: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
April,
2016,
Atlanta, GA,
pages 291–294
Abstract ,
BibTeX ,
PDF ,
Definitive ,
Webpage
The Wild and Crazy Ideas (WACI) session is a longstanding tradition
at ASPLOS, soliciting talks that consist of forward-looking,
visionary, inspiring, creative, far out or just plain amazing ideas
presented in an exciting way. (Amusing elements in the
presentations are tolerated ;-) but are in fact optional.)
The first WACI session took place in 1998. Back then, the call for
talks included a problem statement, which contended that “papers
usually do not get admitted to [such conferences as] ISCA or ASPLOS
unless the systems that they describe are mature enough to run
[some standard benchmark suites, which] has a chilling effect on
the idea generation process—encouraging incremental research”
[1]. The 1998 WACI session turned out to be a great success. Its
webpage states that ”there were 42 submissions [competing over]
only eight time slots, [which resulted in] this session [having] a
lower acceptance rate than the conference itself” [2].
But the times they are a-changin' [3], and the WACI session no
longer enjoys that many submissions (Figure 1), perhaps because
nowadays there exist many forums for researchers to
describe/discuss their preliminary ideas, including: the “hot
topics in” workshops [4–7]; a journal like CAL, dedicated to
early results [8]; main conferences soliciting short submissions
describing “original or unconventional ideas at a preliminary
stage” in addition to regular papers [9]; and the many workshops
co-located with main conferences, like ISCA '15, which hosted
thirteen such workshops [10].
Regardless of the reason for the declining number of submissions,
this time we've decided to organize the WACI session differently to
ensure its continued high quality. Instead of soliciting talks via
an open call and hoping for the best, we proactively invited
speakers whom we believe are capable of delivering excellent WACI
presentations. That is, this year's WACI session consists
exclusively of invited speakers. Filling up the available slots
turned out to be fairly easy, as most of the researchers we invited
promptly accepted our invitation. The duration of each talk was set
to be eight minutes (exactly as in the first WACI session from
1998) plus two minutes for questions.
The talks are outlined below. We believe they are interesting and
exciting, and we hope the attendees of the session will find them
stimulating and insightful.
@InProceedings{tsafrir16-waci,
author |
= |
{Dan Tsafrir}, |
title |
= |
{Synopsis of the {ASPLOS} '16 wild and crazy ideas {(WACI)} invited-speakers session}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2016, |
month |
= |
{April}, |
pages |
= |
{291--294}, |
address |
= |
{Atlanta, GA} |
}
-
vRIO: Paravirtual remote I/O
Yossi Kuperman, Eyal Moscovici, Joel Nider,
Razya Ladelsky, Abel Gordon, Dan Tsafrir
ASPLOS '16: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
April,
2016,
Atlanta, GA,
pages 49–65
Abstract ,
BibTeX ,
PDF ,
Definitive
The traditional “trap and emulate” I/O paravirtualization model
conveniently allows for I/O interposition, yet it inherently
incurs costly guest-host context switches. The newer “sidecore”
model eliminates this overhead by dedicating host (side)cores to
poll the relevant guest memory regions and react accordingly
without context switching. But the dedication of sidecores on each
host might be wasteful when I/O activity is low, or it might not
provide enough computational power when I/O activity is high. We
propose to alleviate this problem at rack scale by consolidating
the dedicated sidecores spread across several hosts onto one
server. The hypervisor is then effectively split into two parts:
the local hypervisor that hosts the VMs, and the remote hypervisor
that processes their paravirtual I/O. We call this model
vRIO—paraVirtual Remote I/O. We find that by increasing the
latency somewhat, it provides comparable throughput with fewer
sidecores and superior throughput with the same number of
sidecores as compared to the state of the art. vRIO additionally
constitutes a new, cost-effective way to consolidate I/O devices
(on the remote hypervisor) while supporting efficient programmable
I/O interposition.
@InProceedings{kuperman16-vrio,
author |
= |
{Yossi Kuperman and Eyal Moscovici and Joel Nider and
Razya Ladelsky and Abel Gordon and Dan Tsafrir}, |
title |
= |
{{vRIO}: {Paravirtual} remote {I/O}}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2016, |
month |
= |
{April}, |
pages |
= |
{49--65}, |
address |
= |
{Atlanta, GA} |
}
-
True IOMMU protection from DMA attacks: when copy is faster than zero copy
Alex Markuze, Adam Morrison, Dan Tsafrir
ASPLOS '16: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
April,
2016,
Atlanta, GA,
pages 249–262
Abstract ,
BibTeX ,
PDF ,
Definitive
Malicious I/O devices might compromise the OS using DMAs. The OS
therefore utilizes the IOMMU to map and unmap every target buffer
right before and after its DMA is processed, thereby restricting
DMAs to their designated locations. This usage model, however, is
not truly secure for two reasons: (1) it provides protection at
page granularity only, whereas DMA buffers can reside on the same
page as other data; and (2) it delays DMA buffer unmaps due to
performance considerations, creating a vulnerability window in
which devices can access in-use memory.
We propose that OSes utilize the IOMMU differently, in a manner
that eliminates these two flaws. Our new usage model restricts
device access to a set of shadow DMA buffers that are never
unmapped, and it copies DMAed data to/from these buffers, thus
providing sub-page protection while eliminating the aforementioned
vulnerability window. Our key insight is that the cost of
interacting with, and synchronizing access to the slow IOMMU
hardware—required for zero-copy protection against devices—make
{\em copying preferable to zero-copying}.
We implement our model in Linux and evaluate it with standard
networking benchmarks utilizing a 40\,Gb/s NIC. We demonstrate that
despite being more secure than the safest preexisting usage model,
our approach provides up to 5× higher
throughput. Additionally, whereas it is inherently less scalable
than an IOMMU-less (unprotected) system, our approach incurs only
0%–25% performance degradation in comparison.
@InProceedings{markuze16-cim,
author |
= |
{Alex Markuze and Adam Morrison and Dan Tsafrir}, |
title |
= |
{True {IOMMU} protection from {DMA} attacks: when copy is faster than zero copy}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2016, |
month |
= |
{April}, |
pages |
= |
{249--262}, |
address |
= |
{Atlanta, GA} |
}
-
The nom profit-maximizing operating system
Muli Ben-Yehuda, Orna Agmon Ben-Yehuda, Dan Tsafrir
VEE '16: ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments
April,
2016,
Atlanta, GA,
pages 145–160
Abstract ,
BibTeX ,
PDF ,
Definitive
In the near future, cloud providers will sell their users virtual
machines with CPU, memory, network, and storage resources whose
prices constantly change according to market-driven supply and
demand conditions. Running traditional operating systems in these
virtual machines is a poor fit: traditional operating systems are
not aware of changing resource prices and their sole aim is to
maximize performance with no consideration of costs. Consequently,
they yield low profits.
We present nom, a profit-maximizing operating system designed for
cloud computing platforms with dynamic resource
prices. Applications running on nom aim to maximize profits by
optimizing simultaneously for performance and resource costs. The
nom kernel provides them with direct access to the underlying
hardware and full control over their private software
stacks. Since nom applications know there is no single “best”
software stack, they adapt their stacks' behavior on the fly
according to the current price of available resources and their
private utility from them, which differs between applications. We
show that in addition to achieving up to 3.9x better throughput
and up to 9.1x better latency, nom applications yield up to 11.1x
higher profits when compared with the same applications running on
Linux and OSv.
@InProceedings{ben-yehuda16-nom,
author |
= |
{Muli Ben-Yehuda and Orna {Agmon Ben-Yehuda} and Dan Tsafrir}, |
title |
= |
{The nom profit-maximizing operating system}, |
booktitle |
= |
{ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments (VEE)}, |
year |
= |
2016, |
month |
= |
{April}, |
pages |
= |
{145--160}, |
address |
= |
{Atlanta, GA} |
}
-
Bare-metal performance for virtual machines with exitless
interrupts
Nadav Amit, Abel Gordon, Nadav Har'El, Muli Ben-Yehuda,
Alex Landau, Assaf Schuster, Dan Tsafrir
CACM: Communications of the ACM
January,
2016,
pages 108–116,
volume 59,
number 1,
CACM Research Highlight
(See technical perspective by Steven Hand.)
Abstract ,
BibTeX ,
PDF ,
Definitive
Direct device assignment enhances the performance of guest virtual
machines by allowing them to communicate with I/O devices without
host involvement. But even with device assignment, guests are
still unable to approach bare-metal performance, because the host
intercepts all interrupts, including those generated by assigned
devices to signal to guests the completion of their I/O
requests. The host involvement induces multiple unwarranted
guest/host context switches, which significantly hamper the
performance of I/O intensive workloads. To solve this problem, we
present ExitLess Interrupts (ELI), a software-only approach for
handling interrupts within guest virtual machines directly and
securely. By removing the host from the interrupt handling path,
ELI improves the throughput and latency of unmodified, untrusted
guests by 1.3x–1.6x, allowing them to reach 97%–100% of
bare-metal performance even for the most demanding I/O-intensive
workloads.
@Article{amit16-elij,
author |
= |
{Nadav Amit and Abel Gordon and Nadav Har'El and Muli Ben-Yehuda and
Alex Landau and Assaf Schuster and Dan Tsafrir}, |
title |
= |
{Bare-metal performance for virtual machines with exitless
interrupts}, |
journal |
= |
{Communications of the ACM (CACM)}, |
volume |
= |
59, |
number |
= |
1, |
year |
= |
2016, |
month |
= |
{January}, |
pages |
= |
{108--116} |
}
2015
-
Virtual CPU validation
Nadav Amit, Dan Tsafrir, Assaf Schuster, Ahmad Ayoub, Eran Shlomo
SOSP '15: ACM Symposium on Operating Systems Principles
October,
2015,
Monterey, CA,
pages 311–327
Abstract ,
BibTeX ,
PDF ,
Definitive
Testing the hypervisor is important for ensuring the correct
operation and security of systems, but it is a hard and
challenging task. We observe, however, that the challenge is
similar in many respects to that of testing real CPUs. We thus
propose to apply the testing environment of CPU vendors to
hypervisors. We demonstrate the advantages of our proposal by
adapting Intel's testing facility to the Linux KVM
hypervisor. We uncover and fix 117 bugs, six of which are
security vulnerabilities. We further find four flaws in Intel
virtualization technology, causing a disparity between the
observable behavior of code running on virtual and bare-metal
servers.
@InProceedings{amit15-hvfuz,
author |
= |
{Nadav Amit and Dan Tsafrir and Assaf Schuster and Ahmad Ayoub and Eran Shlomo}, |
title |
= |
{Virtual {CPU} validation}, |
booktitle |
= |
{ACM Symposium on Operating Systems Principles (SOSP)}, |
year |
= |
2015, |
month |
= |
{October}, |
pages |
= |
{311--327}, |
address |
= |
{Monterey, CA} |
}
-
Securing self-virtualizing Ethernet devices
Igor Smolyar, Muli Ben-Yehuda, Dan Tsafrir
USENIX Security Symposium
August,
2015,
Washington, D.C.,
pages 335–350
Abstract ,
BibTeX ,
PDF ,
Definitive
Single root I/O virtualization (SRIOV) is a hardware/software
interface that allows devices to “self virtualize” and thereby
remove the host from the critical I/O path. SRIOV thus brings
near bare-metal performance to untrusted guest virtual machines
(VMs) in public clouds, enterprise data centers, and
high-performance computing setups. We identify a design flaw in
current Ethernet SRIOV NIC deployments that enables untrusted
VMs to completely control the throughput and latency of other,
unrelated VMs. The attack exploits Ethernet “pause” frames,
which enable network flow control functionality. We
experimentally launch the attack across several NIC models and
find that it is effective and highly accurate, with substantial
consequences if left unmitigated: (1) to be safe, NIC vendors
will have to modify their NICs so as to filter pause frames
originating from SRIOV instances; (2) in the meantime,
administrators will have to either trust their VMs, or configure
their switches to ignore pause frames, thus relinquishing flow
control, which might severely degrade networking performance. We
present the Virtualization-Aware Network Flow Controller
(VANFC), a software-based SRIOV NIC prototype that overcomes the
attack. VANFC filters pause frames from malicious virtual
machines without any loss of performance, while keeping SRIOV
and Ethernet flow control hardware/software interfaces intact.
@InProceedings{smolyar15-sriovsec,
author |
= |
{Igor Smolyar and Muli Ben-Yehuda and Dan Tsafrir}, |
title |
= |
{Securing self-virtualizing {Ethernet} devices}, |
booktitle |
= |
{USENIX Security Symposium}, |
year |
= |
2015, |
month |
= |
{August}, |
pages |
= |
{335--350}, |
address |
= |
{Washington, D.C.} |
}
-
Utilizing the IOMMU scalably
Omer Peleg, Adam Morrison, Benjamin Serebrin, Dan Tsafrir
ATC '15: USENIX Annual Technical Conference
July,
2015,
Santa Clara, CA,
pages 549–562
Abstract ,
BibTeX ,
PDF ,
Definitive
IOMMUs provided by modern hardware allow the OS to enforce
memory protection controls on the DMA operations of its I/O
devices. An IOMMU translation management design must scalably
handle frequent concurrent updates of IOMMU translations made
by multiple cores, which occur in high throughput I/O
workloads such as multi-Gb/s networking. Today, however, OSes
experience performance meltdowns when using the IOMMU in such
workloads.
This paper explores scalable IOMMU management designs and
addresses the two main bottlenecks we find in current OSes:
(1) assignment of I/O virtual addresses (IOVAs), and
(2) management of the IOMMU's TLB.
We propose three approaches for scalable IOVA assignment:
(1) dynamic identity mappings, which eschew IOVA allocation
altogether, (2) allocating IOVAs using the kernel's kmalloc,
and (3) per-core caching of IOVAs allocated by a
globally-locked IOVA allocator. We further describe a scalable
IOMMU TLB management scheme that is compatible with all these
approaches.
Evaluation of our designs under Linux shows that (1) they
achieve 88.5%–100% of the performance obtained without an
IOMMU, (2) they achieve similar latency to that obtained
without an IOMMU, (3) scalable IOVA allocation and dynamic
identity mappings perform comparably, and (4) kmalloc provides
a simple solution with high performance, but can suffer from
unbounded page table blowup.
@InProceedings{peleg15-eskimo,
author |
= |
{Omer Peleg and Adam Morrison and Benjamin Serebrin and Dan Tsafrir}, |
title |
= |
{Utilizing the {IOMMU} scalably}, |
booktitle |
= |
{USENIX Annual Technical Conference (ATC)}, |
year |
= |
2015, |
month |
= |
{July}, |
pages |
= |
{549--562}, |
address |
= |
{Santa Clara, CA} |
}
-
rIOMMU: Efficient IOMMU for I/O devices that employ ring
buffers
Moshe Malka, Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir
ASPLOS '15: ACM International Conference on Architectural Support for
Programming Languages and Operating Systems
March,
2015,
Istanbul, Turkey,
pages 355–368
Abstract ,
BibTeX ,
PDF ,
Definitive
The IOMMU allows the OS to encapsulate I/O devices in their own
virtual memory spaces, thus restricting their DMAs to specific
memory pages. The OS uses the IOMMU to protect itself against
buggy drivers and malicious/errant devices. But the added
protection comes at a cost, degrading the throughput of
I/O-intensive workloads by up to an order of magnitude. This cost
has motivated system designers to trade off some safety for
performance, e.g., by leaving stale information in the IOTLB for a
while so as to amortize costly invalidations.
We observe that high-bandwidth devices—like network and PCIe SSD
controllers—interact with the OS via circular ring buffers that
that induce a sequential, predictable workload. We design a ring
IOMMU (rIOMMU) that leverages this characteristic by replacing the
virtual memory page table hierarchy with a circular, flat table. A
flat table is adequately supported by exactly one IOTLB entry,
making every new translation an implicit invalidation of the
former and thus requiring explicit invalidations only at the end
of I/O bursts. Using standard networking benchmarks, we show that
rIOMMU provides up to 7.56x higher throughput relative to the
baseline IOMMU, and that it is within 0.77–1.00x the throughput
of a system without IOMMU protection.
@InProceedings{malka15-riommu,
author |
= |
{Moshe Malka and Nadav Amit and Muli Ben-Yehuda and Dan Tsafrir}, |
title |
= |
{{rIOMMU}: {Efficient} {IOMMU} for {I/O} devices that employ ring
buffers}, |
booktitle |
= |
{ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS)}, |
year |
= |
2015, |
month |
= |
{March}, |
pages |
= |
{355--368}, |
address |
= |
{Istanbul, Turkey} |
}
-
Efficient intra-operating system protection against harmful DMAs
Moshe Malka, Nadav Amit, Dan Tsafrir
FAST '15: USENIX Conference on File and Storage Technologies
February,
2015,
Santa Clara, CA,
pages 29–44
Abstract ,
BibTeX ,
PDF ,
Definitive
Operating systems can defend themselves against misbehaving I/O
devices and drivers by employing intra-OS protection. With “strict”
intra-OS protection, the OS uses the IOMMU to map each DMA buffer
immediately before the DMA occurs and to unmap it immediately after.
Strict protection is costly due to IOMMU-related hardware overheads,
motivating “deferred” intra-OS protection, which trades off some
safety for performance.
We investigate the Linux intra-OS protection mapping layer and
discover that hardware overheads are not exclusively to blame for its
high cost. Rather, the cost is amplified by the I/O virtual address
(IOVA) allocator, which regularly induces linear complexity. We find
that the nature of IOVA allocation requests is inherently simple and
constrained due to the manner by which I/O devices are used, allowing
us to deliver constant time complexity with a compact, easy-to-implement
optimization. Our optimization improves the throughput of standard
benchmarks by up to 5.5x. It delivers strict protection with
performance comparable to that of the baseline deferred protection.
To generalize our case that OSes drive the IOMMU with suboptimal
software, we additionally investigate the FreeBSD mapping layer and
obtain similar findings.
@InProceedings{malka15-eiovar,
author |
= |
{Moshe Malka and Nadav Amit and Dan Tsafrir}, |
title |
= |
{Efficient intra-operating system protection against harmful {DMAs}}, |
booktitle |
= |
{USENIX Conference on File and Storage Technologies (FAST)}, |
year |
= |
2015, |
month |
= |
{February}, |
pages |
= |
{29--44}, |
address |
= |
{Santa Clara, CA} |
}
2014
-
Experience with using the Parallel Workloads Archive
Dror G. Feitelson, Dan Tsafrir, David Krakov
JPDC: Journal of Parallel and Distributed Computing
October,
2014,
pages 2967–2982,
volume 74,
number 10
Abstract ,
BibTeX ,
PDF ,
Definitive
Workload data is an invaluable resource for the design and
evaluation of computer systems. However, data quality is always an
issue. Therefore the experience gained when using data is also a
precious resource, and both the data and the experience should be
recorded and made available to other researchers. We demonstrate
this for the Parallel Workloads Archive, which records job-level
usage data from many largescale parallel supercomputers, clusters,
and grids. Data quality problems encountered include missing data,
inconsistent data, erroneous data, system configuration changes
during the logging period, and nonrepresentative user
behavior. Some of these may be countered by filtering out the
problematic data items. In other cases, being cognizant of the
problems may affect the decision of which datasets to use.
@Article{feitelson14-pwaexp,
author |
= |
{Dror G. Feitelson and Dan Tsafrir and David Krakov}, |
title |
= |
{Experience with using the {Parallel} {Workloads} {Archive}}, |
journal |
= |
{Journal of Parallel and Distributed Computing (JPDC)}, |
volume |
= |
74, |
number |
= |
10, |
year |
= |
2014, |
month |
= |
{October}, |
pages |
= |
{2967--2982} |
}
-
The rise of RaaS: resource-as-as-service cloud
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster,
Dan Tsafrir
CACM: Communications of the ACM
July,
2014,
volume 57,
number 7
Abstract ,
BibTeX ,
PDF ,
Definitive
Over the next few years, a new model of buying and selling cloud
computing resources will evolve. Instead of providers exclusively
selling server-equivalent virtual machines for relatively long
periods of time (as done in today's IaaS clouds), they will
increasingly sell to clients individual resources (such as CPU,
memory, and I/O resources), reprice them and adjust their quantity
every few seconds. Instead of allocating resources on a
First-come First-served basis, they will do so in accordance with
market-driven supply-and-demand conditions. We term this nascent
economic model of cloud computing the Resource-as-a-Service (RaaS)
cloud, and we argue that its rise is the likely culmination of
recent trends in the construction of Infrastructure-as-a-Service
(IaaS) clouds and of the economic forces operating on both
providers and clients.
@Article{ben-yehuda14-raasj,
author |
= |
{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and
Dan Tsafrir}, |
title |
= |
{The rise of {RaaS:} resource-as-as-service cloud}, |
journal |
= |
{Communications of the ACM (CACM)}, |
volume |
= |
57, |
number |
= |
7, |
year |
= |
2014, |
month |
= |
{July} |
}
-
VSwapper: A memory swapper for virtualized environments
Nadav Amit, Dan Tsafrir, Assaf Schuster
ASPLOS '14: ACM International Conference on Architectural Support for Programming Languages and
Operating Systems
March,
2014,
Salt Lake City, UT,
pages 349–366,
HiPEAC Paper Award
Abstract ,
BibTeX ,
PDF ,
Definitive
The number of guest virtual machines that can be consolidated on one
physical host is typically limited by the memory size, motivating
memory overcommitment. Guests are given a choice to either install a
“balloon” driver to coordinate the overcommitment activity, or to
experience degraded performance due to uncooperative swapping.
Ballooning, however, is not a complete solution, as hosts must still
fall back on uncooperative swapping in various circumstances.
Additionally, ballooning takes time to accommodate change, and so
guests might experience degraded performance under changing
conditions.
Our goal is to improve the performance of hosts when they fall back on
uncooperative swapping and/or operate under changing load conditions.
We carefully isolate and characterize the causes for the associated
poor performance, which include various types of superfluous swap
operations, decayed swap file sequentiality, and ineffective prefetch
decisions upon page faults. We address these problems by implementing
VSwapper, a guest-agnostic memory swapper for virtual
environments that allows efficient, uncooperative overcommitment. With
inactive ballooning, VSwapper yields up to an order of magnitude
performance improvement. Combined with ballooning, VSwapper can achieve
up to double the performance under changing load conditions.
@InProceedings{amit14-vswap,
author |
= |
{Nadav Amit and Dan Tsafrir and Assaf Schuster}, |
title |
= |
{{VSwapper}: {A} memory swapper for virtualized environments}, |
booktitle |
= |
{ACM International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS)}, |
year |
= |
2014, |
month |
= |
{March}, |
pages |
= |
{349--366}, |
address |
= |
{Salt Lake City, UT} |
}
-
Memory swapper for virtualized environments
Nadav Amit, Dan Tsafrir, Assaf Schuster
Patent Number: US 9811268 B2 (granted: Nov 2017)
February,
2014
BibTeX ,
Html
@Misc{amit14-vswappat,
author |
= |
{Nadav Amit and Dan Tsafrir and Assaf Schuster}, |
title |
= |
{Memory swapper for virtualized environments}, |
howpublished |
= |
{Patent Number: US 9811268 B2 (granted: Nov 2017)}, |
year |
= |
2014, |
month |
= |
{February} |
}
2013
-
Deconstructing Amazon EC2 spot instance pricing
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster,
Dan Tsafrir
TEAC: ACM Transactions on Economics and Computation
September,
2013,
pages 16:1–16:20,
volume 1,
number 3
Abstract ,
BibTeX ,
PDF ,
Definitive
Cloud providers possessing large quantities of spare capacity must
either incentivize clients to purchase it or suffer losses. Amazon
is the first cloud provider to address this challenge, by allowing
clients to bid on spare capacity and by granting resources to
bidders while their bids exceed a periodically changing spot
price. Amazon publicizes the spot price but does not disclose how
it is determined. By analyzing the spot price histories of
Amazon's EC2 cloud, we reverse engineer how prices are set and
construct a model that generates prices consistent with existing
price traces. Our findings suggest that usually prices are not
market-driven, as sometimes previously assumed. Rather, they are
likely to be generated most of the time at random from within a
tight price range via a dynamic hidden reserve price mechanism.
Our model could help clients make informed bids, cloud providers
design profitable systems, and researchers design pricing
algorithms.
@Article{ben-yehuda13-spotj,
author |
= |
{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and
Dan Tsafrir}, |
title |
= |
{Deconstructing {Amazon} {EC2} spot instance pricing}, |
journal |
= |
{ACM Transactions on Economics and Computation (TEAC)}, |
volume |
= |
1, |
number |
= |
3, |
year |
= |
2013, |
month |
= |
{September}, |
pages |
= |
{16:1--16:20} |
}
-
The nonkernel: a kernel designed for the cloud
Muli Ben-Yehuda, Omer Peleg, Orna Agmon Ben-Yehuda, Igor
Smolyar, Dan Tsafrir
APSYS '13: ACM Asia-Pacific Workshop on Systems
July,
2013,
Singapore,
pages 4:1–4:7
Abstract ,
BibTeX ,
PDF ,
Definitive
Infrastructure-as-a-Service (IaaS) cloud computing is causing a
fundamental shift in the way computing resources are bought,
sold, and used. We foresee a future whereby every CPU cycle,
every memory word, and every byte of network bandwidth in the
cloud would have a constantly changing market-driven price. We
argue that, in such an environment, the underlying resources
should be exposed directly to applications without kernel or
hypervisor involve- ment. We propose the nonkernel, an
architecture for operating system kernel construction designed
for such cloud computing platforms. A nonkernel uses modern
architectural support for machine virtualization to securely
provide unprivileged user programs with pervasive access to the
underlying resources. We motivate the need for the nonkernel, we
contrast it against its predecessor the exokernel, and we outline
how one could go about building a nonkernel operating system.
@InProceedings{ben-yehuda13-nomish,
author |
= |
{Muli Ben-Yehuda and Omer Peleg and Orna {Agmon Ben-Yehuda} and Igor
Smolyar and Dan Tsafrir}, |
title |
= |
{The nonkernel: a kernel designed for the cloud}, |
booktitle |
= |
{ACM Asia-Pacific Workshop on Systems (APSYS)}, |
year |
= |
2013, |
month |
= |
{July}, |
pages |
= |
{4:1--4:7}, |
address |
= |
{Singapore} |
}
-
Using disk add-ons to withstand simultaneous disk failures
with fewer replicas
Eitan Rosenfeld, Nadav Amit, Dan Tsafrir
WIVOSCA '13: Workshop on the Interaction amongst Virtualization,
Operating Systems and Computer Architecture
June,
2013,
Tel Aviv, Israel
Abstract ,
BibTeX ,
PDF ,
Definitive
Contemporary storage systems that utilize replication often
maintain more than two replicas of each data item, reducing the
risk of permanent data loss due to simultaneous disk
failures. The price of the additional copies is smaller usable
storage space, increased network traffic, and higher power
consumption. We propose to alleviate this problem with SimFail, a
storage system that maintains only two replicas and utilizes
per-disk “add-ons”, which are simple hardware devices
equipped with relatively small memory that proxy disk I/O
traffic. SimFail can significantly reduce the risk of data loss
due to temporally adjacent disk failures by quickly copying
atrisk data from disks to their add-ons. SIMFAIL can further
eliminate the risk entirely by maintaining local parity
information of disks on their add-ons (such that each add-on
holds the parity of its own disk's data chunks). We postulate
that SimFail may open the door for cloud providers to reduce the
number of data replicas they use from three to two.
@InProceedings{rosenfeld13-simfail,
author |
= |
{Eitan Rosenfeld and Nadav Amit and Dan Tsafrir}, |
title |
= |
{Using disk add-ons to withstand simultaneous disk failures
with fewer replicas}, |
booktitle |
= |
{Workshop on the Interaction amongst Virtualization,
Operating Systems and Computer Architecture (WIVOSCA)}, |
year |
= |
2013, |
month |
= |
{June}, |
address |
= |
{Tel Aviv, Israel} |
}
-
Using disk add-ons to withstand simultaneous disk failures with fewer replicas
Dan Tsafrir, Eitan Rosenfeld
Patent Number: US 9535802 B2 (granted: Jan 2017)
January,
2013
BibTeX ,
Html
@Misc{tsafrir13-raidppat,
author |
= |
{Dan Tsafrir and Eitan Rosenfeld}, |
title |
= |
{Using disk add-ons to withstand simultaneous disk failures with fewer replicas}, |
howpublished |
= |
{Patent Number: US 9535802 B2 (granted: Jan 2017)}, |
year |
= |
2013, |
month |
= |
{January} |
}
2012
-
The resource-as-a-service (RaaS) cloud
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster,
Dan Tsafrir
HotCloud '12: USENIX Workshop on Hot Topics in Cloud Computing
June,
2012,
Boston, MA
Abstract ,
BibTeX ,
PDF ,
Definitive
Over the next few years, a new model of buying and selling cloud
computing resources will evolve. Instead of providers exclusively
selling server equivalent virtual machines for relatively long
periods of time (as done in today's IaaS clouds), providers will
increasingly sell individual resources (such as CPU, memory, and
I/O resources) for a few seconds at a time. We term this nascent
economic model of cloud computing the Resource-as-a-Service (RaaS)
cloud, and we argue that its rise is the likely culmination of
recent trends in the construction of IaaS clouds and of the
economic forces operating on both providers and clients.
@InProceedings{ben-yehuda12-raas,
author |
= |
{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and
Dan Tsafrir}, |
title |
= |
{The resource-as-a-service {(RaaS)} cloud}, |
booktitle |
= |
{USENIX Workshop on Hot Topics in Cloud Computing (HotCloud)}, |
year |
= |
2012, |
month |
= |
{June}, |
address |
= |
{Boston, MA} |
}
-
Feasibility of mutable replay for automated regression
testing of security updates
Ilia Kravets, Dan Tsafrir
RESoLVE '12: Workshop on Runtime Environments, Systems, Layering and Virtualized
Environments
March,
2012,
London, UK
Abstract ,
BibTeX ,
PDF ,
Definitive
Applying software patches runs the risk of somehow breaking the
system, so organizations often prefer to avoid updates or to
defer them until extensive testing is performed. But unpatched
software might be compromised, and extensive testing might be
tedious and time and resource consuming. We consider alleviating
the problem by utilizing a new type of “mutable” deterministic
execution replay, which would allow us to (1) run the software
before and after the patch and to (2) compare the behavior of the
two versions in a manner that tolerates legitimate differences
that arise from security patches. The potential advantages of
this approach are that: it is automatic; it requires no
programming skills; it is language-agnostic; it works without
source code; and it can be applied while the system is running
and serving real production workloads. In this work, we do not
implement mutable replay; rather, we assess its
feasibility. We do so by systematically studying about 200
security patches applied to the most popular Debian/Linux
packages and by considering how mutable replay should be
implemented so as to tolerate execution differences that such
patches generate.
@InProceedings{kravets12-mreplay,
author |
= |
{Ilia Kravets and Dan Tsafrir}, |
title |
= |
{Feasibility of mutable replay for automated regression
testing of security updates}, |
booktitle |
= |
{Workshop on Runtime Environments, Systems, Layering and Virtualized
Environments (RESoLVE)}, |
year |
= |
2012, |
month |
= |
{March}, |
address |
= |
{London, UK} |
}
-
ELI: bare-metal performance for I/O virtualization
Abel Gordon, Nadav Amit, Nadav Har'El, Muli Ben-Yehuda,
Alex Landau, Assaf Schuster, Dan Tsafrir
ASPLOS '12: ACM International Conference on Architectural Support for Programming Languages and
Operating Systems
March,
2012,
London, UK,
pages 411–422,
Pat Goldberg Memorial Best Paper Award,
HiPEAC Paper Award,
CACM Research Highlight
Abstract ,
BibTeX ,
PDF ,
Definitive
Direct device assignment enhances the performance of guest virtual
machines by allowing them to communicate with I/O devices without
host involvement. But even with device assignment, guests are
still unable to approach bare-metal performance, because the host
intercepts all interrupts, including those interrupts generated by
assigned devices to signal to guests the completion of their I/O
requests. The host involvement induces multiple unwarranted
guest/host context switches, which significantly hamper the
performance of I/O intensive workloads. To solve this problem, we
present ELI (ExitLess Interrupts), a software-only approach for
handling interrupts within guest virtual machines directly
and securely. By removing the host from the interrupt
handling path, ELI manages to improve the throughput and latency
of unmodified, untrusted guests by 1.3x–1.6x, allowing them to
reach 97%–100% of bare-metal performance even for the most
demanding I/O-intensive workloads.
@InProceedings{gordon12-eli,
author |
= |
{Abel Gordon and Nadav Amit and Nadav Har'El and Muli Ben-Yehuda and
Alex Landau and Assaf Schuster and Dan Tsafrir}, |
title |
= |
{{ELI}: bare-metal performance for {I/O} virtualization}, |
booktitle |
= |
{ACM International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS)}, |
year |
= |
2012, |
month |
= |
{March}, |
pages |
= |
{411--422}, |
address |
= |
{London, UK} |
}
2011
-
Deconstructing Amazon EC2 spot instance pricing
Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster,
Dan Tsafrir
CloudCom '11: IEEE International Conference on Cloud Computing Technology
and Science
November,
2011,
Athens, Greece
Abstract ,
BibTeX ,
PDF ,
Definitive
Cloud providers possessing large quantities of spare capacity must
either incentivize clients to purchase it or suffer losses. Amazon
is the first cloud provider to address this challenge, by allowing
clients to bid on spare capacity and by granting resources to
bidders while their bids exceed a periodically changing spot
price. Amazon publicizes the spot price but does not disclose how
it is determined. By analyzing the spot price histories of
Amazon's EC2 cloud, we reverse engineer how prices are set and
construct a model that generates prices consistent with existing
price traces. We find that prices are usually not market-driven as
sometimes previously assumed. Rather, they are typically generated
at random from within a tight price interval via a dynamic hidden
reserve price. Our model could help clients make informed bids,
cloud providers design profitable systems, and researchers design
pricing algorithms.
@InProceedings{ben-yehuda11-spotprice,
author |
= |
{Orna {Agmon Ben-Yehuda} and Muli Ben-Yehuda and Assaf Schuster and
Dan Tsafrir}, |
title |
= |
{Deconstructing {Amazon} {EC2} spot instance pricing}, |
booktitle |
= |
{IEEE International Conference on Cloud Computing Technology
and Science (CloudCom)}, |
year |
= |
2011, |
month |
= |
{November}, |
address |
= |
{Athens, Greece} |
}
-
vIOMMU: efficient IOMMU emulation
Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, Assaf Schuster
ATC '11: USENIX Annual Technical Conference
June,
2011,
Portland, Oregon,
pages 73–86
Abstract ,
BibTeX ,
PDF ,
Definitive
Direct device assignment, where a guest virtual machine directly
interacts with an I/O device without host intervention, is
appealing, because it allows an unmodified (non-hypervisor-aware)
guest to achieve near-native performance. But device assignment
for unmodified guests suffers from two serious deficiencies: (1)
it requires pinning of all the guest's pages, thereby disallowing
memory overcommitment, and (2) it exposes the guest's memory to
buggy device drivers.
We solve these problems by designing, implementing, and exposing
an emulated IOMMU (vIOMMU) to the unmodified guest. We employ two
novel optimizations to make vIOMMU perform well: (1) waiting a few
milliseconds before tearing down an IOMMU mapping in the hope it
will be immediately reused “optimistic teardown”, and (2)
running the vIOMMU on a sidecore, and thereby enabling for the
first time the use of a sidecore by unmodified guests. Both
optimizations are highly effective in isolation. The former allows
bare-metal to achieve 100
of the two allows an unmodified guest to do the same.
@InProceedings{amit11-viommu,
author |
= |
{Nadav Amit and Muli Ben-Yehuda and Dan Tsafrir and Assaf Schuster}, |
title |
= |
{{vIOMMU}: efficient {IOMMU} emulation}, |
booktitle |
= |
{USENIX Annual Technical Conference (ATC)}, |
year |
= |
2011, |
month |
= |
{June}, |
pages |
= |
{73--86}, |
address |
= |
{Portland, Oregon} |
}
-
A method for guaranteeing program correctness using
finegrained hardware speculative execution
Dan Tsafrir, Robert W. Wisniewski
Patent Number: US 9195550 B2 (granted: Nov 2015)
February,
2011
BibTeX ,
Html
@Misc{tsafrir11-specupat,
author |
= |
{Dan Tsafrir and Robert W. Wisniewski}, |
title |
= |
{A method for guaranteeing program correctness using
finegrained hardware speculative execution}, |
howpublished |
= |
{Patent Number: US 9195550 B2 (granted: Nov 2015)}, |
year |
= |
2011, |
month |
= |
{February} |
}
2010
-
Using inaccurate estimates accurately
Dan Tsafrir
JSSPP '10: Workshop on Job Scheduling Strategies for Parallel
Processing
April,
2010,
Atlanta, Georgia,
pages 208–221,
LNCS Volume 6253,
keynote talk
Abstract ,
BibTeX ,
PDF ,
Definitive
Job schedulers improve the system utilization by requiring users
to estimate how long their jobs will run and by using this
information to better pack (or “backfill”) the jobs. But,
surprisingly, many studies find that deliberately making estimates
less accurate boosts (or does not affect) the performance,
which helps explain why production systems still exclusively rely
on notoriously inaccurate estimates.
We prove these studies wrong by showing their methodology is
erroneous. The studies model an estimate e as being correlated
with r·F (where r is the runtime of the associated job,
F is some "badness" factor, and bigger Fs imply increased
inaccuracy). We show this model is invalid, because: (1) it
conveys too much information to the scheduler; (2) it induces
favoritism of short jobs; and (3) it is inherently different than
real user inaccuracy, which associates 90% of the jobs with
merely 20 estimate values, hindering the scheduler's ability to
backfill.
We conclude that researchers must stop using multiples of runtimes
as estimates, or else their results would likely be invalid. We
develop (and propose to use) a realistic model that preserves the
estimates' modality and allows to soundly simulate increased
inaccuracy by, e.g., associating more jobs with the maximal
runtime allowed (an always-popular estimate, which prevents
backfilling).
@InCollection{tsafrir10-keynote,
author |
= |
{Dan Tsafrir}, |
title |
= |
{Using inaccurate estimates accurately}, |
booktitle |
= |
{Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP)}, |
year |
= |
2010, |
month |
= |
{April}, |
pages |
= |
{208--221}, |
address |
= |
{Atlanta, Georgia}, |
note |
= |
{LNCS Volume 6253} |
}
2009
-
SCARY iterator assignment and initialization revision 1
Robert Klarer, Bjarne Stroustrup, Dan Tsafrir, Michael
Wong
ISO/IEC C++ Standards Committee Paper WG21/N2980=09-0170
October,
2009,
Santa Cruz, California
Abstract ,
BibTeX ,
PDF ,
Definitive
We propose a requirement that a standard container's iterator
types have no dependency on any type argument apart from the
container's value_type, difference_type,
pointer type, and const_pointer type. In
particular, the types of a standard container's iterators should
not depend on the container's key_compare,
hasher, key_equal, or allocator types.
@InProceedings{klarer09-scary-rev1,
author |
= |
{Robert Klarer and Bjarne Stroustrup and Dan Tsafrir and Michael
Wong}, |
title |
= |
{SCARY iterator assignment and initialization revision 1}, |
booktitle |
= |
{ISO/IEC C++ Standards Committee Paper WG21/N2980=09-0170}, |
year |
= |
2009, |
month |
= |
{October}, |
address |
= |
{Santa Cruz, California} |
}
-
Minimizing dependencies within generic classes for faster
and smaller programs
Dan Tsafrir, Robert W. Wisniewski, David F. Bacon,
Bjarne Stroustrup
OOPSLA '09: ACM SIGPLAN Conference on Object-Oriented
Programming Systems, Languages, and Applications
October,
2009,
Orlando, Florida,
pages 425–444
Abstract ,
BibTeX ,
PDF ,
Definitive
Generic classes can be used to improve performance by allowing
compile-time polymorphism. But the applicability of compile-time
polymorphism is narrower than that of runtime polymorphism, and it
might bloat the object code. We advocate a programming principle
whereby a generic class should be implemented in a way that
minimizes the dependencies between its members (nested types,
methods) and its generic type parameters. Conforming to this
principle (1) reduces the bloat and (2) gives rise to a previously
unconceived manner of using the language that expands the
applicability of compile-time polymorphism to a wider range of
problems. Our contribution is thus a programming technique that
generates faster and smaller programs. We apply our ideas to
GCC's STL containers and iterators, and we demonstrate notable
speedups and reduction in object code size (real application runs
1.2x to 2.1x faster and STL code is 1x to 25x smaller). We
conclude that standard generic APIs (like STL) should be amended
to reflect the proposed principle in the interest of efficiency
and compactness. Such modifications will not break old code,
simply increase flexibility. Our findings apply to languages like
C++, C#, and D, which realize generic programming through multiple
instantiations.
@InProceedings{tsafrir09-inner,
author |
= |
{Dan Tsafrir and Robert W. Wisniewski and David F. Bacon and
Bjarne Stroustrup}, |
title |
= |
{Minimizing dependencies within generic classes for faster
and smaller programs}, |
booktitle |
= |
{ACM SIGPLAN Conference on Object-Oriented
Programming Systems, Languages, and Applications (OOPSLA)}, |
year |
= |
2009, |
month |
= |
{October}, |
pages |
= |
{425--444}, |
address |
= |
{Orlando, Florida} |
}
-
SCARY iterator assignment and initialization
Robert Klarer, Bjarne Stroustrup, Dan Tsafrir, Michael
Wong
ISO/IEC C++ Standards Committee Paper WG21/N2913=09-0103
July,
2009,
Frankfurt, Germany
Abstract ,
BibTeX ,
PDF ,
Definitive
We propose a requirement that a standard container's iterator
types have no dependency on any type argument apart from the
container's value_type, difference_type,
pointer type, and const_pointer type. In
particular, the types of a standard container's iterators should
not depend on the container's key_compare,
hasher, key_equal, or allocator types.
@InProceedings{klarer09-scary,
author |
= |
{Robert Klarer and Bjarne Stroustrup and Dan Tsafrir and Michael
Wong}, |
title |
= |
{SCARY iterator assignment and initialization}, |
booktitle |
= |
{ISO/IEC C++ Standards Committee Paper WG21/N2913=09-0103}, |
year |
= |
2009, |
month |
= |
{July}, |
address |
= |
{Frankfurt, Germany} |
}
2008
-
Portably solving file races with hardness amplification
Dan Tsafrir, Tomer Hertz, Dilma Da Silva, David Wagner
TOS: ACM Transactions on Storage
November,
2008,
page 9,
volume 4,
number 3
Superseded by ,
Abstract ,
BibTeX ,
PDF ,
Definitive
The file-system API of contemporary systems makes programs vulnerable
to TOCTTOU (time-of-check-to-time-of-use) race conditions.
Existing solutions either help users to detect these problems (by
pinpointing their locations in the code), or prevent the problem
altogether (by modifying the kernel or its API).
But the latter alternative is not prevalent, and the former is just the
first step: programmers must still address TOCTTOU flaws within the
limits of the existing API with which several important tasks cannot
be accomplished in a portable straightforward manner.
Recently, Dean and Hu [2004] addressed this problem and suggested a
probabilistic hardness amplification approach that alleviated the
matter.
Alas, shortly after, Borisov et al. [2005] responded
with an attack termed
“filesystem maze” that defeated the new approach.
We begin by noting that mazes constitute a generic way to
deterministically win many TOCTTOU races (gone are the days when the
probability was small).
In the face of this threat, we (1) develop a new user-level defense
that can withstand mazes, and (2) show that our method is undefeated
even by much stronger hypothetical attacks that provide the adversary
program with ideal conditions to win the race (enjoying complete and
instantaneous knowledge about the defending program's actions and
being able to perfectly synchronize accordingly).
The fact that our approach is immune to these unrealistic attacks suggests
it can be used as a simple and portable solution to a large class of
TOCTTOU vulnerabilities, without requiring modifications to the
underlying operating system.
@Article{tsafrir08-tocttou:tos,
author |
= |
{Dan Tsafrir and Tomer Hertz and Dilma Da Silva and David Wagner}, |
title |
= |
{Portably solving file races with hardness amplification}, |
journal |
= |
{ACM Transactions on Storage (TOS)}, |
volume |
= |
4, |
number |
= |
3, |
year |
= |
2008, |
month |
= |
{November}, |
pages |
= |
9 |
}
-
Portably preventing file race attacks
with user-mode path resolution
Dan Tsafrir, Tomer Hertz, David Wagner, Dilma Da Silva
Technical Report RC24572, IBM T. J. Watson Research Center
June,
2008,
Yorktown Heights, New York,
submitted
Abstract ,
BibTeX ,
PDF ,
Definitive
The filesystem API of contemporary systems exposes programs to
TOCTTOU (time of check to time of use) race-condition
vulnerabilities, which occur between pairs of check/use system
calls that involve a name of a file. Existing solutions either
help programmers to detect such races (by pinpointing their
location) or prevent them altogether (by altering the operating
system). But the latter alternative is not prevalent, and the
former is just the first step: programmers must still address
TOCTTOU flaws within the limits of the existing API with which
several important tasks can not be safely accomplished in a
portable straightforward manner. The recent “filesystem maze”
attack further worsens the problem by allowing adversaries to
deterministically win races and thus refuting the common
perception that the risk is small. In the face of this threat, we
develop a new algorithm that allows programmers to effectively
aggregate a vulnerable pair of distinct system calls into a single
operation that is executed “atomically”. This is achieved by
emulating one kernel functionality in user mode: the filepath
resolution. The surprisingly simple resulting algorithm
constitutes a portable solution to a large class of TOCTTOU
vulnerabilities, without requiring modifications to the underlying
operating system. In contrast to our previous work, the new
algorithm is deterministic and incurs significantly less overhead.
@TechReport{tsafrir08-path-res,
author |
= |
{Dan Tsafrir and Tomer Hertz and David Wagner and Dilma Da Silva}, |
title |
= |
{Portably preventing file race attacks
with user-mode path resolution}, |
institution |
= |
{IBM T. J. Watson Research Center}, |
number |
= |
{RC24572}, |
year |
= |
2008, |
month |
= |
{June}, |
address |
= |
{Yorktown Heights, New York}, |
note |
= |
{(submitted for publication)} |
}
-
The murky issue of changing process identity: revising “setuid demystified”
Dan Tsafrir, Dilma Da Silva, David Wagner
USENIX ;login
June,
2008,
pages 55–66,
volume 33,
number 3
Abstract ,
BibTeX ,
PDF ,
Definitive ,
Software
Dropping unneeded process privileges promotes security, but is
notoriously error-prone due to confusing set∗id system
calls with unclear semantics and subtle portability issues.
To make things worse, existing recipes to accomplish the task are
lacking, related manuals can be misleading, and the
associated kernel subsystem might contain bugs.
We therefore proclaim the system as untrustworthy when it comes to the
subject matter and suggest a defensive easy-to-use solution that
addresses all concerns.
@Article{tsafrir08-priv,
author |
= |
{Dan Tsafrir and Dilma Da Silva and David Wagner}, |
title |
= |
{The murky issue of changing process identity: revising ``setuid demystified''}, |
journal |
= |
{USENIX ;login}, |
volume |
= |
33, |
number |
= |
3, |
year |
= |
2008, |
month |
= |
{June}, |
pages |
= |
{55--66} |
}
-
Portably solving file TOCTTOU races with hardness amplification
Dan Tsafrir, Tomer Hertz, David Wagner, Dilma Da Silva
FAST '08: USENIX Conference on File and Storage Technologies
February,
2008,
San Jose, California,
pages 189–206,
awarded Best Paper,
Pat Goldberg Memorial Best Paper Award
Superseded by ,
Abstract ,
BibTeX ,
PDF ,
Definitive
The file-system API of contemporary systems makes programs
vulnerable to TOCTTOU (time of check to time of use) race
conditions. Existing solutions either help users to detect these
problems (by pinpointing their locations in the code), or prevent
the problem altogether (by modifying the kernel or its API). The
latter alternative is not prevalent, and the former is just the
first step: programmers must still address TOCTTOU flaws within
the limits of the existing API with which several important tasks
can not be accomplished in a portable straightforward manner.
Recently, Dean and Hu addressed this problem and suggested a
probabilistic hardness amplification approach that alleviated the
matter. Alas, shortly after, Borisov et al. responded with an
attack termed “filesystem maze” that defeated the new approach.
We begin by noting that mazes constitute a generic way to
deterministically win many TOCTTOU races (gone are the days when
the probability was small). In the face of this threat, we (1)
develop a new user-level defense that can withstand mazes, and (2)
show that our method is undefeated even by much stronger
hypothetical attacks that provide the adversary program with ideal
conditions to win the race (enjoying complete and instantaneous
knowledge about the defending program's actions and being able to
perfectly synchronize accordingly). The fact that our approach is
immune to these unrealistic attacks suggests it can be used as a
simple and portable solution to a large class of TOCTTOU
vulnerabilities, without requiring modifications to the underlying
operating system.
@InProceedings{tsafrir08-tocttou,
author |
= |
{Dan Tsafrir and Tomer Hertz and David Wagner and Dilma Da Silva}, |
title |
= |
{Portably solving file {TOCTTOU} races with hardness amplification}, |
booktitle |
= |
{USENIX Conference on File and Storage Technologies (FAST)}, |
year |
= |
2008, |
month |
= |
{February}, |
pages |
= |
{189--206}, |
address |
= |
{San Jose, California} |
}
-
Specialized execution environments
Maria Butrico, Dilma Da Silva, Orran Krieger, Michal Ostrowski,
Bryan Rosenburg, Dan Tsafrir, Eric Van Hensbergen, Robert W. Wisniewski,
Jimi Xenidis
OSR: ACM Operating Systems Review
January,
2008,
pages 106–107,
volume 42,
number 1
BibTeX ,
PDF ,
Definitive
@Article{butrico08-specialized,
author |
= |
{Maria Butrico and Dilma Da Silva and Orran Krieger and Michal Ostrowski and
Bryan Rosenburg and Dan Tsafrir and Eric Van Hensbergen and Robert W. Wisniewski and
Jimi Xenidis}, |
title |
= |
{Specialized execution environments}, |
journal |
= |
{ACM Operating Systems Review (OSR)}, |
volume |
= |
42, |
number |
= |
1, |
year |
= |
2008, |
month |
= |
{January}, |
pages |
= |
{106--107} |
}
2007
-
Reducing performance evaluation sensitivity and variability
by input shaking
Dan Tsafrir, Keren Ouaknine, Dror G. Feitelson
MASCOTS '07: IEEE International Symposium on Modeling, Analysis, and
Simulation of Computer and Telecommunication Systems
October,
2007,
Istanbul, Turkey,
pages 231–237
Abstract ,
BibTeX ,
PDF ,
Definitive
Simulations sometimes lead to observed sensitivity to
configuration parameters as well as inconsistent performance
results. The question is then what is the true effect and what is
a coincidental artifact of the evaluation. The shaking methodology
answers this by executing multiple simulations under small
perturbations to the input workload, and calculating the average
performance result; if the effect persists we can be more
confident that it is real, whereas if it disappears it was an
artifact. We present several examples where the sensitivity that
appears in results based on a single evaluation is eliminated or
considerably reduced by the shaking methodology. While our
examples come from evaluations of scheduling algorithms for
supercomputers, we believe the method has wider applicability.
@InProceedings{tsafrir07-shake,
author |
= |
{Dan Tsafrir and Keren Ouaknine and Dror G. Feitelson}, |
title |
= |
{Reducing performance evaluation sensitivity and variability
by input shaking}, |
booktitle |
= |
{IEEE International Symposium on Modeling, Analysis, and
Simulation of Computer and Telecommunication Systems
(MASCOTS)}, |
year |
= |
2007, |
month |
= |
{October}, |
pages |
= |
{231--237}, |
address |
= |
{Istanbul, Turkey} |
}
-
Secretly monopolizing the CPU without superuser privileges
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
USENIX Security Symposium
August,
2007,
Boston, Massachusetts,
pages 239–256
Abstract ,
BibTeX ,
PDF ,
Definitive
, Coverage:
Slashdot ,
Ars Technica ,
ACM TechNews
We describe a “cheat” attack, allowing an ordinary process to
hijack any desirable percentage of the CPU cycles without
requiring superuser/administrator privileges. Moreover, the
nature of the attack is such that, at least in some systems,
listing the active processes will erroneously show the cheating
process as not using any CPU resources: the “missing” cycles
would either be attributed to some other process or not be
reported at all (if the machine is otherwise idle). Thus, certain
malicious operations generally believed to have required
overcoming the hardships of obtaining root access and installing a
rootkit, can actually be launched by non-privileged users in a
straightforward manner, thereby making the job of a malicious
adversary that much easier. We show that most major
general-purpose operating systems are vulnerable to the cheat
attack, due to a combination of how they account for CPU usage and
how they use this information to prioritize competing processes.
Furthermore, recent scheduler changes attempting to better support
interactive workloads increase the vulnerability to the attack,
and naive steps taken by certain systems to reduce the danger are
easily circumvented. We show that the attack can nevertheless be
defeated, and we demonstreate this by implementing a patch for
Linux that eliminates the problem with negligible
overhead.
@InProceedings{tsafrir07-cheat,
author |
= |
{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson}, |
title |
= |
{Secretly monopolizing the {CPU} without superuser privileges}, |
booktitle |
= |
{USENIX Security Symposium}, |
year |
= |
2007, |
month |
= |
{August}, |
pages |
= |
{239--256}, |
address |
= |
{Boston, Massachusetts} |
}
-
The context-switch overhead inflicted by hardware
interrupts (and the enigma of do-nothing loops)
Dan Tsafrir
ExpCS '07: ACM Workshop on Experimental Computer Science
June,
2007,
San-Diego, California,
page 4
Abstract ,
BibTeX ,
PDF ,
Definitive
The overhead of a context switch is typically associated with
multitasking, where several applications share a processor. But
even if only one runnable application is present in the system and
supposedly runs alone, it is still repeatedly preempted in favor
of a different thread of execution, namely, the operating system
that services periodic clock interrupts. We employ two
complementing methodologies to measure the overhead incurred by
such events and obtain contradictory results.
The first methodology systematically changes the interrupt
frequency and measures by how much this prolongs the duration of a
program that sorts an array. The overall overhead is found to be
0.5-1.5% at 1000 Hz, linearly proportional to the tick rate, and
steadily declining as the speed of processors increases. If the
kernel is configured such that each tick is slowed down by an
access to an external time source, then the direct overhead
dominates. Otherwise, the relative weight of the indirect
portion is steadily growing with processors' speed, accounting for
up to 85% of the total.
The second methodology repeatedly executes a simplistic loop
(calibrated to take 1ms), measures the actual execution time, and
analyzes the perturbations. Some loop implementations yield
results similar to the above, but others indicate that the
overhead is actually an order of magnitude bigger, or worse. The
phenomenon was observed on IA32, IA64, and Power processors, the
latter being part of the \textsf{ASC Purple} supercomputer.
Importantly, the effect is greatly amplified for parallel jobs,
where one late thread holds up all its peers, causing a slowdown
that is dominated by the per-node latency (numerator) and the job
granularity (denominator). We trace the bizarre effect to an
unexplained interrupt/loop interaction; the question of whether
this hardware misfeature is experienced by real applications
remains open.
@InProceedings{tsafrir07-context,
author |
= |
{Dan Tsafrir}, |
title |
= |
{The context-switch overhead inflicted by hardware
interrupts (and the enigma of do-nothing loops)}, |
booktitle |
= |
{ACM Workshop on Experimental Computer Science (ExpCS)}, |
year |
= |
2007, |
month |
= |
{June}, |
pages |
= |
4, |
address |
= |
{San-Diego, California} |
}
-
Backfilling using system-generated predictions rather than
user runtime estimates
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
TPDS: IEEE Transactions on Parallel and Distributed Systems
June,
2007,
pages 789–803,
volume 18,
number 6
Abstract ,
BibTeX ,
PDF ,
Definitive
The most commonly used scheduling algorithm for parallel
supercomputers is FCFS with backfilling, as originally introduced
in the EASY scheduler. Backfilling means that short jobs are
allowed to run ahead of their time provided they do not delay
previously queued jobs (or at least the first queued job). To
make such determinations possible, users are required to provide
estimates of how long jobs will run, and jobs that violate these
estimates are killed. Empirical studies have repeatedly shown
that user estimates are inaccurate, and that system-generated
predictions based on history may be significantly better.
However, predictions have not been incorporated into production
schedulers, partially due to a misconception (that we resolve)
claiming inaccuracy actually improves performance, but mainly
because underprediction is technically unacceptable: users will
not tolerate jobs being killed just because system predictions
were too short. We solve this problem by divorcing kill-time from
the runtime prediction, and correcting predictions adaptively as
needed if they are proved wrong. The end result is a surprisingly
simple scheduler, which requires minimal deviations from current
practices (e.g. using FCFS as the basis), and behaves exactly
like EASY as far as users are concerned; nevertheless, it achieves
significant improvements in performance, predictability, and
accuracy. Notably, this is based on a very simple runtime
predictor that just averages the runtimes of the last two jobs by
the same user; counterintuitively, our results indicate that using
recent data is more important than mining the history for similar
jobs. All the techniques suggested in this paper can be used to
enhance any backfilling algorithm, and are not limited to EASY.
@Article{tsafrir07-pred,
author |
= |
{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson}, |
title |
= |
{Backfilling using system-generated predictions rather than
user runtime estimates}, |
journal |
= |
{IEEE Transactions on Parallel and Distributed Systems (TPDS)}, |
volume |
= |
18, |
number |
= |
6, |
year |
= |
2007, |
month |
= |
{June}, |
pages |
= |
{789--803} |
}
-
Fine grained kernel logging with KLogger: experience and
insights
Yoav Etsion, Dan Tsafrir, Scott Kirkpatrick,
Dror G. Feitelson
EuroSys '07: ACM Europran Conference on Computer Systems
March,
2007,
Lisbon, Portugal,
pages 259–274
Abstract ,
BibTeX ,
PDF ,
Definitive ,
Software
Understanding the detailed behavior of an operating system is
crucial for making informed design decisions. But such an
understanding is very hard to achieve, due to the increasing
complexity of such systems and the fact that they are implemented
and maintained by large and diverse groups of developers. Tools
like KLogger — presented in this paper — can help by enabling
fine-grained logging of system events and the sharing of a logging
infrastructure between multiple developers and researchers,
facilitating a methodology where design evaluation can be an
integral part of kernel development. We demonstrate the need for
such methodology by a host of case studies, using KLogger to
better understand various kernel subsystems in Linux, and
pinpointing overheads and problems therein.
@InProceedings{etsion07-klogger,
author |
= |
{Yoav Etsion and Dan Tsafrir and Scott Kirkpatrick and
Dror G. Feitelson}, |
title |
= |
{Fine grained kernel logging with {KLogger:} experience and
insights}, |
booktitle |
= |
{ACM Europran Conference on Computer Systems (EuroSys)}, |
year |
= |
2007, |
month |
= |
{March}, |
pages |
= |
{259--274}, |
address |
= |
{Lisbon, Portugal} |
}
2006
-
Process prioritization using output production: scheduling
for multimedia
Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
TOMM: ACM Transactions on Multimedia Computing, Communications
and Applications
November,
2006,
pages 318–342,
volume 2,
number 4
Abstract ,
BibTeX ,
PDF ,
Definitive
Desktop operating systems such as Windows and Linux base
scheduling decisions on CPU consumption: processes that consume
fewer CPU cycles are prioritized, assuming that interactive
processes gain from this since they spend most of their time
waiting for user input. However, this doesn't work for modern
multimedia applications which require significant CPU
resources. We therefore suggest a new metric to identify
interactive processes by explicitly measuring interactions with
the user, and we use it to design and implement a process
scheduler. Measurements using a variety of applications indicate
that this scheduler is very effective in distinguishing between
competing interactive and noninteractive processes.
@Article{etsion06-huc,
author |
= |
{Yoav Etsion and Dan Tsafrir and Dror G. Feitelson}, |
title |
= |
{Process prioritization using output production: scheduling
for multimedia}, |
journal |
= |
{ACM Transactions on Multimedia Computing, Communications
and Applications (TOMM)}, |
volume |
= |
2, |
number |
= |
4, |
year |
= |
2006, |
month |
= |
{November}, |
pages |
= |
{318--342} |
}
-
Stop polling! The case against OS ticks
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
OSDI '06: USENIX Symposium on Operating Systems Design and
Implementation
November,
2006,
Seattle, Washington,
Poster Session
BibTeX ,
PDF ,
Definitive ,
Poster
@InProceedings{tsafrir06-ticks,
author |
= |
{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson}, |
title |
= |
{Stop polling! {The} case against {OS} ticks}, |
booktitle |
= |
{USENIX Symposium on Operating Systems Design and
Implementation (OSDI)}, |
year |
= |
2006, |
month |
= |
{November}, |
address |
= |
{Seattle, Washington}, |
note |
= |
{Poster Session} |
}
-
The dynamics of backfilling: solving the mystery of why
increased inaccuracy may help
Dan Tsafrir, Dror G. Feitelson
IISWC '06: IEEE International Symposium on Workload Characterization
October,
2006,
San Jose, California,
pages 131–141
Abstract ,
BibTeX ,
PDF ,
Definitive
Parallel job scheduling with backfilling requires users to provide
runtime estimates, used by the scheduler to better pack the jobs.
Studies of the impact of such estimates on performance have
modeled them using a “badness factor” f ≥ 0 in an attempt to
capture their inaccuracy (given a runtime r, the estimate is
uniformly distributed in [r, (f+1) · r]). Surprisingly,
inaccurate estimates (f>0) yielded better performance than
accurate ones (f=0). We explain this by a “heel and toe”
dynamics that, with f>0, cause backfilling to approximate
shortest-job first scheduling. We further find the effect of
systematically increasing f is V-shaped: average wait time and
slowdown initially drop, only to rise later on. This happens
because higher fs create bigger “holes” in the schedule
(longer jobs can backfill) and increase the randomness (more long
jobs appear as short), thus overshadowing the initial heel-and-toe
preference for shorter jobs.
The bottom line is that artificial inaccuracy generated by
multiplying (real or perfect) estimates by a factor is (1) just a
scheduling technique that trades off fairness for performance, and
is (2) ill-suited for studying the effect of real
inaccuracy. Real estimates are modal (90% of the jobs use the
same 20 estimates) and bounded by a maximum (usually the most
popular estimate). Therefore, when performing an evaluation,
“increased inaccuracy” should translate to increased modality.
Unlike multiplying, this indeed worsens performance as one would
intuitively expect.
@InProceedings{tsafrir06-bfdyn,
author |
= |
{Dan Tsafrir and Dror G. Feitelson}, |
title |
= |
{The dynamics of backfilling: solving the mystery of why
increased inaccuracy may help}, |
booktitle |
= |
{IEEE International Symposium on Workload Characterization
(IISWC)}, |
year |
= |
2006, |
month |
= |
{October}, |
pages |
= |
{131--141}, |
address |
= |
{San Jose, California} |
}
-
Modeling, evaluating, and improving the performance of
supercomputer scheduling
Dan Tsafrir
PhD: Technical Report 2006-78, School of Computer Science and
Engineering, the Hebrew University
September,
2006,
Jerusalem, Israel
Abstract ,
BibTeX ,
PDF ,
Definitive
The most popular scheduling policy for parallel systems is FCFS
with backfilling (a.k.a. “EASY” scheduling), where short jobs
are allowed to run ahead of their time provided they do not delay
previously queued jobs (or at least the first queued job). This
mandates users to provide estimates of how long jobs will run, and
jobs that violate these estimates are killed so as not to violate
subsequent commitments. The de-facto standard of evaluating the
impact of inaccurate estimates on performance has been to use a
“badness factor” f ≥ 0, such that given a runtime r, the
associated estimate is uniformly distributed in [r, r ·
(f+1)], or is simply r·(f+1). The underlying assumption
was that bigger fs imply worse information.
Surprisingly, inaccurate estimates (f>0) yield better
performance than accurate ones (f=0), a fact that has repeatedly
produced statements like “inaccurate estimates actually improve
performance” or “what the scheduler doesn't know won't hurt
it”, in many independent studies. This has promoted the
perception that estimates are “unimportant”. At the same time,
other studies noted that real user estimates are inaccurate, and
that system-generated predictions based on history can do better.
But predictions were never incorporated into production
schedulers, partially due the aforementioned perception that
inaccuracy actually helps, partially because suggested predictors
were too complex, and partially because underprediction is
technically unacceptable, as users will not tolerate jobs being
killed just because system predictions were too short. All
attempts to solve the latter technicality yielded algorithms that
are inappropriate for many supercomputing settings (e.g. using
preemption, assuming all jobs are restartable, etcetera).
This work has four major contributions. First, we show
that the “inaccuracy helps” common wisdom is merely an
unwarranted artifact of the erroneous manner in which inaccurate
estimates have been modeled, and that increased accuracy actually
improves performance. Specifically, previously observed
improvements turn out to be due to a “heel and toe” dynamics
that, with f>0, cause backfilling to approximate shortest-job
first scheduling. We show that multiplying estimates by a factor
translates to trading off fairness for performance, and that this
reasoning works regardless of whether the values being multiplied
are actual runtimes (“perfect estimates”) or the flawed
estimates that are supplied by users. We further show that the
more accurate the values we multiply, the better the resulting
performance. Thus, better estimates actually improve performance,
and multiplying is in fact a scheduling policy that
exercises the fairness/performance tradeoff. Regardless,
multiplying is anything but representative of real inaccuracy, as
outlined next.
Our second contribution is developing a more
representative model of estimates that, from now on, will allow
for a valid evaluation of the effect of inaccurate estimates. It
is largely based on noting that human users repeatedly use the
same “round” values (ten minutes, one hour etc.) and on the
invariant that 90% of the jobs use the same 20 estimates.
Importantly, the most popular estimate is typically the maximal
allowed. As a result, the jobs associated with this estimate
cannot be backfilled, and indeed, the more this value is used, the
more EASY resembles plain FCFS. Thus, to artificially increase
the inaccuracy one should e.g. associate more jobs with the
maximum (a realistic manipulation), not multiply by a
greater factor (a bogus boost of performance).
Our third contribution exploits the above understandings
to devise a new scheduler that is able to automatically improve
the quality of estimates and put this into productive use in the
context of EASY, while preserving its attractive simple batch
essence and refraining from any unacceptable assumptions.
Specifically, the problem of underprediction is solved by
divorcing kill-time from the runtime prediction, and correcting
predictions adaptively at runtime as needed, if they are proved
wrong. The result is a surprisingly simple scheduler, which
requires minimal deviations from current practices, and behaves
exactly like EASY as far as users are concerned. Nevertheless, it
achieves significant improvements in performance, predictability,
and accuracy. Notably, this is based on a very simple runtime
predictor that just averages the runtimes of the last two jobs by
the same user; counterintuitively, our results indicate that using
recent data is more important than saving and mining the history
for similar jobs, as was done by previous work. For further
performance enhancements, we propose to exploit the “heel and
toe” understanding: explicitly using a shortest job
backfilled first (SJBF) backfilling order. This directly
leads to a performance improvements similar to those previously
attributed to stunts like multiplying estimates. By still
preserving FCFS as the basis, we maintain EASY's appeal and enjoy
both worlds: a fair scheduler that nevertheless backfills
effectively.
Finally, our fourth contribution has broader
applicability, that transcends the supercomputing domain. All of
the above results are based on the standard methodology of
modeling and simulating real activity logs of production systems,
which is routinely practiced in system-related research. The
overwhelmingly accepted assumption underlying this methodology is
that such real workloads are representative and reliable. We
show, however, that real workloads may also contain anomalies that
make them non-representative and unreliable. This is a special
case of multi-class workloads, where one class is the “real”
workload which we wish to use in the evaluation, and the other
class contaminates the log with “bogus” data. We provide
several examples of this situation, including an anomaly we call
“workload flurries”: surges of activity with a repetitive
nature, caused by a single user, that dominate the workload for a
relatively short period. Using a workload with such anomalies in
effect emphasizes rare and unique events (e.g. occurring for a
few days out of two years of logged data), and risks optimizing
the design decision for the anomalous workload at the expense of
the normal workload. Thus, we claim that such anomalies should be
removed from the workload before it is used in evaluations, and
that ignoring them is actually an unjustifiable approach.
@PhdThesis{tsafrir06-phd,
author |
= |
{Dan Tsafrir}, |
title |
= |
{Modeling, evaluating, and improving the performance of
supercomputer scheduling}, |
school |
= |
{School of Computer Science and
Engineering, the Hebrew University (PhD Thesis)}, |
year |
= |
2006, |
month |
= |
{September}, |
address |
= |
{Jerusalem, Israel}, |
note |
= |
{Technical Report 2006-78} |
}
-
Session-based, estimation-less, and information-less
runtime prediction algorithms for parallel and grid job
scheduling
David Talby, Dan Tsafrir, Zviki Goldberg, Dror
G. Feitelson
Technical Report 2006-77, School of Computer Science and
Engineering, the Hebrew University
August,
2006,
Jerusalem, Israel
Abstract ,
BibTeX ,
PDF ,
Definitive
The default setting of most production parallel job schedulers is
FCFS with backfilling. Under this setting, users must supply job
runtime estimates, which are known as being highly inaccurate and
inferior to system generated predictions. Recent research revealed
how to utilize system predictions for backfilling, and this
potential performance gain motivates searching for better
prediction techniques. We present three prediction techniques
using decreasing levels of information as is suitable for the
situation at hand. The first is based on "user sessions":
continuous temporal periods of per-user work. This algorithm
exploits the entire long-term historical data of the workload,
along with user runtime estimates. The second is
"estimation-less", that is, uses historical data only, relieving
users from the annoying need to supply estimates. The third is
completely "informationless" and is suitable for cases in which
neither historical information nor estimates are available, as
happens in some grid environments. We evaluate the algorithms by
simulating real data from production systems. We find all of them
to be successful in terms of both accuracy and performance.
@TechReport{talby06-predalg,
author |
= |
{David Talby and Dan Tsafrir and Zviki Goldberg and Dror
G. Feitelson}, |
title |
= |
{Session-based, estimation-less, and information-less
runtime prediction algorithms for parallel and grid job
scheduling}, |
institution |
= |
{School of Computer Science and
Engineering, the Hebrew University}, |
number |
= |
{2006-77}, |
year |
= |
2006, |
month |
= |
{August}, |
address |
= |
{Jerusalem, Israel} |
}
-
Instability in parallel job scheduling simulation: the role
of workload flurries
Dan Tsafrir, Dror G. Feitelson
IPDPS '06: IEEE International Parallel and Distributed Processing
Symposium
April,
2006,
Rhodes Island, Greece,
page 10
Abstract ,
BibTeX ,
PDF ,
Definitive
The performance of computer systems depends, among other things,
on the workload. This motivates the use of real workloads (as
recorded in activity logs) to drive simulations of new designs.
Unfortunately, real workloads may contain various anomalies that
contaminate the data. A previously unrecognized type of anomaly
is workload flurries: rare surges of activity with a repetitive
nature, caused by a single user, that dominate the workload for a
relatively short period. We find that long workloads often
include at least one such event. We show that in the context of
parallel job scheduling these events can have a significant effect
on performance evaluation results, e.g. a very small perturbation
of the simulation conditions might lead to a large and
disproportional change in the outcome. This instability is due to
jobs in the flurry being effected in unison, a consequence of the
flurry's repetitive nature. We therefore advocate that flurries
be filtered out before the workload is used, in order to achieve
stable and more reliable evaluation results (analogously to the
removal of outliers in statistical analysis). At the same time,
we note that more research is needed on the possible effects of
flurries.
@InProceedings{tsafrir06-bfly,
author |
= |
{Dan Tsafrir and Dror G. Feitelson}, |
title |
= |
{Instability in parallel job scheduling simulation: the role
of workload flurries}, |
booktitle |
= |
{IEEE International Parallel and Distributed Processing
Symposium (IPDPS)}, |
year |
= |
2006, |
month |
= |
{April}, |
pages |
= |
10, |
address |
= |
{Rhodes Island, Greece} |
}
-
Workload sanitation for performance evaluation
Dror G. Feitelson, Dan Tsafrir
ISPASS '06: IEEE International Symposium on Performance Analysis of
Systems and Software
March,
2006,
Austin, Texas,
pages 221–230
Abstract ,
BibTeX ,
PDF ,
Definitive
The performance of computer systems depends, among other things,
on the workload. Performance evaluations are therefore often done
using logs of workloads on current productions systems, under the
assumption that such real workloads are representative and
reliable; likewise, workload modeling is typically based on real
workloads. We show, however, that real workloads may also contain
anomalies that make them non-representative and unreliable. This
is a special case of multi-class workloads, where one class is the
“real” workload which we wish to use in the evaluation, and the
other class contaminates the log with “bogus” data. We provide
several examples of this situation, including a previously
unrecognized type of anomaly we call “workload flurries”: surges
of activity with a repetitive nature, caused by a single user,
that dominate the workload for a relatively short period. Using a
workload with such anomalies in effect emphasizes rare and unique
events (e.g. occurring for a few days out of two years of logged
data), and risks optimizing the design decision for the anomalous
workload at the expense of the normal workload. Thus we claim that
such anomalies should be removed from the workload before it is
used in evaluations, and that ignoring them is actually an
unjustifiable approach.
@InProceedings{feitelson06-clean,
author |
= |
{Dror G. Feitelson and Dan Tsafrir}, |
title |
= |
{Workload sanitation for performance evaluation}, |
booktitle |
= |
{IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS)}, |
year |
= |
2006, |
month |
= |
{March}, |
pages |
= |
{221--230}, |
address |
= |
{Austin, Texas} |
}
-
System and method for backfilling with system-generated
predictions rather than user runtime estimates
Dan Tsafrir, Yoav Etsion, David Talby, Dror G. Feitelson
Patent Number: US 8261283 B2 (granted: Sep 2012)
February,
2006
BibTeX ,
Html
@Misc{tsafrir06-predpat,
author |
= |
{Dan Tsafrir and Yoav Etsion and David Talby and Dror G. Feitelson}, |
title |
= |
{System and method for backfilling with system-generated
predictions rather than user runtime estimates}, |
howpublished |
= |
{Patent Number: US 8261283 B2 (granted: Sep 2012)}, |
year |
= |
2006, |
month |
= |
{February} |
}
2005
-
System noise, OS clock ticks, and fine-grained parallel
applications
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson, Scott
Kirkpatrick
ICS '05: ACM International Conference on Supercomputing
June,
2005,
Cambridge, Massachusetts,
pages 303–312
Abstract ,
BibTeX ,
PDF ,
Definitive
As parallel jobs get bigger in size and finer in granularity,
“system noise” is increasingly becoming a problem. In fact,
fine-grained jobs on clusters with thousands of SMP nodes run
faster if a processor is intentionally left idle (per node), thus
enabling a separation of “system noise” from the computation.
Paying a cost in average processing speed at a node for the sake
of eliminating occasional processes delays is (unfortunately)
beneficial, as such delays are enormously magnified when one late
process holds up thousands of peers with which it synchronizes.
We provide a probabilistic argument showing that, under certain
conditions, the effect of such noise is linearly proportional to
the size of the cluster (as is often empirically observed). We
then identify a major source of noise to be indirect overhead of
periodic OS clock interrupts (“ticks”), that are used by all
general-purpose OSs as a means of maintaining control. This is
shown for various grain sizes, platforms, tick frequencies, and
OSs. To eliminate such noise, we suggest replacing ticks with an
alternative mechanism we call “smart timers”. This turns out to
also be in line with needs of desktop and mobile computing,
increasing the chances of the suggested change to be accepted.
@InProceedings{tsafrir05-noise,
author |
= |
{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson and Scott
Kirkpatrick}, |
title |
= |
{System noise, {OS} clock ticks, and fine-grained parallel
applications}, |
booktitle |
= |
{ACM International Conference on Supercomputing
(ICS)}, |
year |
= |
2005, |
month |
= |
{June}, |
pages |
= |
{303--312}, |
address |
= |
{Cambridge, Massachusetts} |
}
-
Modeling user runtime estimates
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
JSSPP '05: Workshop on Job Scheduling Strategies for Parallel
Processing
June,
2005,
Cambridge, Massachusetts,
pages 1–35,
Lecture Notes in Computer Science, Volume 3834
Abstract ,
BibTeX ,
PDF ,
Definitive ,
Software
User estimates of job runtimes have emerged as an important
component of the workload on parallel machines, and can have a
significant impact on how a scheduler treats different jobs, and
thus on overall performance. It is therefore highly desirable to
have a good model of the relationship between parallel jobs and
their associated estimates. We construct such a model based on a
detailed analysis of several workload traces. The model
incorporates those features that are consistent in all of the
logs, most notably the inherently modal nature of estimates (e.g.\
only 20 different values are used as estimates for about 90% of
the jobs). We find that the behavior of users, as manifested
through the estimate distributions, is remarkably similar across
the different workload traces. Indeed, providing our model with
only the maximal allowed estimate value, along with the percentage
of jobs that have used it, yields results that are very similar to
the original. The remaining difference (if any) is largely
eliminated by providing information on one or two additional
popular estimates. Consequently, in comparison to previous
models, simulations that utilize our model are better in
reproducing scheduling behavior similar to that observed when
using real estimates.
@InCollection{tsafrir05-est,
author |
= |
{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson}, |
title |
= |
{Modeling user runtime estimates}, |
booktitle |
= |
{Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP)}, |
year |
= |
2005, |
month |
= |
{June}, |
pages |
= |
{1--35}, |
address |
= |
{Cambridge, Massachusetts}, |
note |
= |
{Lecture Notes in Computer Science, Volume 3834} |
}
-
A short survey of commercial cluster batch schedulers
Yoav Etsion, Dan Tsafrir
Technical Report 2005-13, School of Computer Science and
Engineering, the Hebrew University
May,
2005,
Jerusalem, Israel
Abstract ,
BibTeX ,
PDF ,
Definitive
As high performance computing clusters are getting cheaper, they
become more accessible. The various clusters are running a host of
workload management software suites, which are getting more
complex and offer cluster administrators numerous features,
scheduling policies, job prioritization schemes, etc. In this
paper we survey some of the common commercial workload managers on
the market, covering their main features — specifically the
scheduling policies and algorithms they support, their priority
and queueing mechanisms, focusing on their default settings.
@TechReport{etsion05-survey,
author |
= |
{Yoav Etsion and Dan Tsafrir}, |
title |
= |
{A short survey of commercial cluster batch schedulers}, |
institution |
= |
{School of Computer Science and
Engineering, the Hebrew University}, |
number |
= |
{2005-13}, |
year |
= |
2005, |
month |
= |
{May}, |
address |
= |
{Jerusalem, Israel} |
}
-
General purpose timing: the failure of periodic timers
Dan Tsafrir, Yoav Etsion, Dror G. Feitelson
Technical Report 2005-6, School of Computer Science and
Engineering, the Hebrew University
February,
2005,
Jerusalem, Israel
Abstract ,
BibTeX ,
PDF ,
Definitive
All general-purpose commodity operating systems use periodic clock
interrupts to regain control and measure the passage of time. This
is ill-suited for desktop settings, as the ne-grained timing
requirements of modern multimedia applications require a high
clock rate, which may suffer from significant overhead. It is
ill-suited for HPC environments, as asynchronous interrupts ruin
the coordination among cluster nodes. And it is ill-suited for
mobile platforms, as it wastes signicant energy, especially when
the system is otherwise idle. To be truly general-purpose, systems
should therefore switch to a mechanism that is closer to one-shot
timers (set only for specic needs) while avoiding the potentially
huge overhead they entail. With a careful design it is possible to
achieve both high accuracy and low overhead, thus signicantly
extending the applicability of general-purpose operating systems.
@TechReport{tsafrir05-timing,
author |
= |
{Dan Tsafrir and Yoav Etsion and Dror G. Feitelson}, |
title |
= |
{General purpose timing: the failure of periodic timers}, |
institution |
= |
{School of Computer Science and
Engineering, the Hebrew University}, |
number |
= |
{2005-6}, |
year |
= |
2005, |
month |
= |
{February}, |
address |
= |
{Jerusalem, Israel} |
}
2004
-
Desktop scheduling: how can we know what the user wants?
Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
NOSSDAV '04: ACM International Workshop on Network and Operating Systems
Support for Digital Audio and Video
June,
2004,
Kinsale, Ireland,
pages 110–115
Abstract ,
BibTeX ,
PDF ,
Definitive
Current desktop operating systems use CPU utilization (or lack
thereof) to prioritize processes for scheduling. This was thought
to be beneficial for interactive processes, under the assumption
that they spend much of their time waiting for user input. This
reasoning fails for modern multimedia applications. For example,
playing a movie in parallel with a heavy background job usually
leads to poor graphical results, as these jobs are
indistinguishable in terms of CPU usage. Suggested solutions
involve shifting the burden to the user or programmer, which we
claim is unsatisfactory; instead, we seek an automatic
solution. Our attempts using new metrics based on CPU usage
failed. We therefore propose and implement a novel scheme of
identifying interactive and multimedia applications by directly
quantifying the I/O between an application and the user (keyboard,
mouse, and screen activity). Preliminary results indicate that
prioritizing processes according to this metric indeed solves the
aforementioned problem, demonstrating that operating systems can
indeed provide better support for multimedia and interactive
applications. Additionally, once user I/O data is available, it
opens intriguing new possibilities to system designers.
@InProceedings{etsion04-huc-pri,
author |
= |
{Yoav Etsion and Dan Tsafrir and Dror G. Feitelson}, |
title |
= |
{Desktop scheduling: how can we know what the user wants?}, |
booktitle |
= |
{ACM International Workshop on Network and Operating Systems
Support for Digital Audio and Video (NOSSDAV)}, |
year |
= |
2004, |
month |
= |
{June}, |
pages |
= |
{110--115}, |
address |
= |
{Kinsale, Ireland} |
}
2003
-
Workload flurries
Dan Tsafrir, Dror G. Feitelson
Technical Report 2003-85, School of Computer Science and
Engineering, the Hebrew University
November,
2003,
Jerusalem, Israel
Abstract ,
BibTeX ,
PDF
The performance of computer systems depends, among other things,
on the workload. Performance evaluations are therefore often done
using logs of workloads on current productions systems, under the
assumption that such real workloads are representative and
reliable; likewise, workload modeling is typically based on real
workloads. However, real workloads may also contain anomalies that
make them non-representative and unreliable. A previously
unrecognized type of anomaly is workload urries: surges of
activity with a repetitive nature, caused by a single user, that
dominate the workload for a relatively short period. Under
suitable conditions, such urries can have a decisive effect on
performance evaluation results. The problem is that workloads
containing such a urry are not representative of normal
usage. Moreover, creating a statistical model based on such a
workload or using it directly is also not representative of urries
in general. This motivates the approach of identifying and
removing the urries, so as to allow for an evaluation under normal
conditions. We demonstrate this for several evaluations of
parallel systems, showing that the anomalies in the workload as
embodied by urries carry over to anomalies in the evaluation
results, which disappear when the urries are removed. Such an
evaluation can then be augmented by a separate evaluation of the
deviation caused by the urry.
@TechReport{tsafrir03-flurry,
author |
= |
{Dan Tsafrir and Dror G. Feitelson}, |
title |
= |
{Workload flurries}, |
institution |
= |
{School of Computer Science and
Engineering, the Hebrew University}, |
number |
= |
{2003-85}, |
year |
= |
2003, |
month |
= |
{November}, |
address |
= |
{Jerusalem, Israel} |
}
-
Effects of clock resolution on the scheduling of
interactive and soft real-time processes
Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
SIGMETRICS '03: ACM SIGMETRICS International Conference on Measurement and
Modeling
June,
2003,
San Diego, California,
pages 172–183
Abstract ,
BibTeX ,
PDF ,
Definitive
It is commonly agreed that scheduling mechanisms in general
purpose operating systems do not provide adequate support for
modern interactive applications, notably multimedia
applications. The common solution to this problem is to devise
specialized scheduling mechanisms that take the specific needs of
such applications into account. A much simpler alternative is to
better tune existing systems. In particular, we show that
conventional scheduling algorithms typically only have little and
possibly misleading information regarding the CPU usage of
processes, because increasing CPU rates have caused the common 100
Hz clock interrupt rate to be coarser than most application time
quanta. We therefore conduct an experimental analysis of what
happens if this rate is significantly increased. Results indicate
that much higher clock interrupt rates are possible with
acceptable overheads, and lead to much better information. In
addition we show that increasing the clock rate can provide a
measure of support for soft real time requirements, even when
using a general-purpose operating system. For example, we achieve
a sub-millisecond latency under heavily loaded conditions.
@InProceedings{etsion03-clock,
author |
= |
{Yoav Etsion and Dan Tsafrir and Dror G. Feitelson}, |
title |
= |
{Effects of clock resolution on the scheduling of
interactive and soft real-time processes}, |
booktitle |
= |
{ACM SIGMETRICS International Conference on Measurement and
Modeling (SIGMETRICS)}, |
year |
= |
2003, |
month |
= |
{June}, |
pages |
= |
{172--183}, |
address |
= |
{San Diego, California} |
}
2002
-
Barrier synchronization on a loaded SMP using two-phase
waiting algorithms
Dan Tsafrir, Dror G. Feitelson
IPDPS '02: IEEE International Parallel and Distributed Processing
Symposium
April,
2002,
Fort Lauderdale, Florida,
page 80
Abstract ,
BibTeX ,
PDF ,
Definitive
Little work has been done on the performance of barrier
synchronization using two-phase blocking, as the common wisdom is
that it is useless to spin if the total number of threads in the
system exceeds the number of processors. We challenge this and
show that it may be beneficial to spin-wait even if the number of
threads is up to double the number of processors, especially if
the waiting time is at least twice the context switch overhead
(rather than being equal to it). We also characterize the
alternating synchronization pattern that applications based on
barriers tend to fall into, which is quite different from the
patterns typically assumed in theoretical analyses.
@InProceedings{tsafrir02-sync,
author |
= |
{Dan Tsafrir and Dror G. Feitelson}, |
title |
= |
{Barrier synchronization on a loaded {SMP} using two-phase
waiting algorithms}, |
booktitle |
= |
{IEEE International Parallel and Distributed Processing
Symposium (IPDPS)}, |
year |
= |
2002, |
month |
= |
{April}, |
pages |
= |
80, |
address |
= |
{Fort Lauderdale, Florida} |
}
2001
-
Barrier synchronization on a loaded SMP using two-phase
waiting algorithms
Dan Tsafrir
MSc: Technical Report 2001-82, School of Computer Science and
Engineering, the Hebrew University
September,
2001,
Jerusalem, Israel
Abstract ,
BibTeX ,
PDF ,
Definitive
Little work has been done on the performance of barrier
synchronization using two-phase blocking, as the common wisdom is that
it is useless to spin if the total number of threads in the system
exceeds the number of processors.
We challenge this view and show that it may be beneficial to spin-wait
if the spinning period is set to be a bit more than twice the context
switch overhead (rather than being equal to it).
We show that the success of our approach is due to an inherent
property of general-purpose schedulers, which tend to select
threads that become unblocked for immediate execution.
We find that this property causes applications based on barriers to
fall into a previously unnoticed pattern, denoted “alternating
synchronization”, which is quite different from the patterns
typically assumed in theoretical analyses.
By merely choosing an appropriate spinning period, we leverage
alternating synchronization to implicitly nudge the system
into simultaneously co-scheduling the application's threads, thereby
dramatically reducing the overhead of synchronization and
significantly improving the performance.
@MastersThesis{tsafrir01-msc,
author |
= |
{Dan Tsafrir}, |
title |
= |
{Barrier synchronization on a loaded {SMP} using two-phase
waiting algorithms}, |
school |
= |
{School of Computer Science and
Engineering, the Hebrew University (MSc Thesis)}, |
year |
= |
2001, |
month |
= |
{September}, |
address |
= |
{Jerusalem, Israel}, |
note |
= |
{Technical Report 2001-82} |
}
|