Skip to content Skip to navigation




Millisort: a distributed sorting for millisecond-scale time budget [slide]
MilliSort is a new sorting benchmark for massively parallel computing. Its challenge is as follows: “Given an unlimited number of servers in a datacenter, how many small items can be sorted in one millisecond?” We invented a new distributed sorting algorithm which can harness thousands of parallel server nodes for a few milliseonds. The new sorting algorithm minimizes the cost of coordination among nodes and makes the massive parallelization possible. To extrapolate the performance of the new algorithm to very large cluster sizes, we developed a cost simulator. According to the simulator, given a 1 ms time budget, we can sort about 3.7 million 100B tuples by harnessing 330 nodes. Given a 10 ms time budget, we can sort 300 million tuples by harnessing 3010 nodes.

CURP: Consistent Unordered Replication Protocol [paper][video] [Implementation: RAMCloud, Redis-client, Redis-server, Raft]
Traditional approaches to replication require client requests to be ordered before making them durable by copying them to replicas. As a result, clients must wait for two round-trip times (RTTs) before updates complete. On the other hand, Consistent Unordered Replication Protocol (CURP) allows clients to replicate requests that have not yet been ordered, as long as they are commutative. This strategy allows most operations to complete in 1 RTT (the same as an unreplicated system). I implemented CURP in the Redis and RAMCloud storage systems. In RAMCloud, CURP improved write latency by 2x and write throughput by 4x.

NanoLog: A Nanosecond Scale Logging System [paper]
NanoLog is a nanosecond scale logging system that is 1-2 orders of magnitude faster than existing logging systems such as Log4j2, spdlog, Boost log or Event Tracing for Windows. The system achieves a throughput up to 80 million log messages per second for simple messages and has a typical log invocation overhead of 8-18 nanoseconds, despite exposing a traditional printf-like API. NanoLog slims down user log messages at compile-time by extracting static log components, outputs the log in a compacted, binary format at runtime, and utilizes an offline process to re-inflate the compacted logs. The lower cost of NanoLog allows developers to log more often, log in more detail, and use logging in low-latency production settings where traditional logging mechanisms are too expensive.

RIFL: Reusable Infrastructure for Linearizability [paper] [video]
Linearizability is the strongest form of consistency for concurrent systems, but most large-scale storage systems settle for weaker forms of consistency. RIFL provides a general-purpose mechanism for converting at-least-once RPC semantics to exactly-once semantics, thereby making it easy to turn non-linearizable operations into linearizable ones. RIFL is designed for large-scale systems and is lightweight enough to be used in low-latency environments. RIFL handles data migration by associating linearizability metadata with objects in the underlying store and migrating metadata with the corresponding objects. It uses a lease mechanism to implement garbage collection for metadata. We have implemented RIFL in the RAMCloud storage system and used it to make basic operations such as writes and atomic increments linearizable; RIFL adds only 530 ns to the 13.5 us base latency for durable writes. This project is integrated into the main branch of RAMCloud [gitHub].

Distributed Transactions in RAMCloud [paper]
Constructed a new distributed transaction mechanism, with which clients can commit a transaction in 1 RTT. Clients send PREPARE requests to servers, and servers respond with votes. The clients can complete the transaction after collecting COMMIT-VOTEs from all participating servers. RIFL's exactly-once mechanism is used to simplify desgin and implementation of the multi-object transaction mechanism; the votes (which is the responses of PREPARE RPCs) are made durable and managed automatically by RIFL. The transaction mechanism can commit simple distributed transactions in about 20 us and it outperforms the H-Store main-memory database system for the TPC-C benchmark. This project is integrated into the main branch of RAMCloud [gitHub].

RAMCloud [website]
RAMCloud is a low-latency large-scale distributed key-value storage system. It was designed to provide lowest possible latency while maintaining strong consistency. Working on RAMCloud motivated my other projects, such as CURP, RIFL, and RIFL-based transaction mechanism, which provide the lowest possible latency while keeping strong consistency.



Exploiting Commutativity For Practical Fast Replication
Seo Jin Park and John Ousterhout
16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19), February 2019

NanoLog: A Nanosecond Scale Logging System
Stephen Yang, Seo Jin Park and John Ousterhout
2018 USENIX Annual Technical Conference (ATC’18), July 2018

Implementing Linearizability at Large Scale and Low Latency
Collin Lee*, Seo Jin Park*, Ankita Kejriwal, Satoshi Matsushita, and John Ousterhout (*co-first author)
The 25th ACM Symposium on Operating Systems Principles (SOSP'15), October 2015

The RAMCloud Storage System
John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin Park, Henry Qin, Mendel Rosenblum, Stephen Rumble, Ryan Stutsman, and Stephen Yang
ACM Transactions on Computer Systems (TOCS) 33, 3, Article 7, August 2015

Analyzing performance and usability of broadcast-based inter-core communication (ATAC) on manycore architecture
Master's thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, Jun 2013




Consistency in Asynchronous Replication

Linearizability and Transactions