Tenetur itaque dolorum mollitia facere reprehenderit praesentium ullam. Accusantium id et illo ut eum. Dolore rerum qui sit eveniet ut nesciunt quam. Molestias fugiat omnis laborum quam.
Implementing Raft: Part 3 – Persistence and Optimizations

Implementing Raft: Part 3 – Persistence and Optimizations

This is Part 3 in a series of posts describing the Raft distributed consensus
algorithm and its complete implementation in Go. Here is a list of
posts in the series:

In this part we’ll complete our basic implementation of Raft, by adding
persistence and some optimizations. All the code for this part is located in
this directory.


The goal of a consensus algorithm like Raft is to create a system that has
higher availability than its parts by replicating a task across isolated
servers. So far, we’ve been focusing on the fault scenario of network
, where some servers in the cluster become disconnected from others
(or from clients). Another mode of failure is crashes, wherein a server stops
working and restarts.

While for other servers it appears like a network partition – a server got
temporarily disconnected, for the crashed server itself the situation is quite
different because by restarting all its volatile memory state is lost.

Precisely for this reason, Figure 2 in the Raft paper clearly marks which state
should be persistent; persistent state is written and flushed to nonvolatile
storage every time it’s updated. Whatever state has to be persisted by a server
is persisted before the server issues the next RPC or replies to an ongoing

Raft can get by with persisting only a subset of its state, namely:

  • currentTerm – the latest term this server has observed
  • votedFor – the peer ID for whom this server voted in the latest term
  • log – Raft log entries

Q: Why are commitIndex and lastApplied volatile?

A: commitIndex is volatile because Raft can figure out a correct value
for it after a reboot using just the persistent state. Once a leader
successfully gets a new log entry committed, it knows everything before that
point is also committed. A follower that crashes and comes back up will be told
about the right commitIndex whenever the current leader sends it an AE.

lastApplied starts at zero after a reboot because the basic Raft algorithm
assumes the service (e.g., a key/value database) doesn’t keep any persistent
state. Thus its state needs to be completely recreated by replaying all log
entries. This is rather inefficient, of course, so many optimization ideas
are possible. Raft supports snapshotting the log when it grows large; this
is described in section 6 of the Raft paper, and is out of scope for this series
of posts.

Command delivery semantics

In Raft, depending on circumstances a command may be delivered to a client more
than once. There are several scenarios in which this can happen, including
crashes an restarts (when a log is replayed again).

In terms of message delivery semantics, Raft is in the at-least-once camp.
Once a command is submitted, it will be replicated to all clients eventually,
but some clients may see the same command more than once. Therefore, it’s
recommended that commands carry unique IDs and clients ignore commands that
were already delivered. This is described in a bit more detail in section 8
of the Raft paper.

Storage interface

To implement persistence, we’re adding the following interface to the code:

type Storage interface {
  Set(key string, value []byte)

  Get(key string) ([]byte, bool)

  // HasData returns true iff any Sets were made on this Storage.
  HasData() bool

You can think of it as a map from string to a generic byte slice, backed by a
persistent store.

Restoring and saving state

The CM constructor will now take a Storage as an argument and call:

if cm.storage.HasData() {

The restoreFromStorage method is also new. It loads the persisted state
variables from storage, deserializing them using the standard encoding/gob

func (cm *ConsensusModule) restoreFromStorage(storage Storage) {
  if termData, found := cm.storage.Get("currentTerm"); found {
    d := gob.NewDecoder(bytes.NewBuffer(termData))
    if err := d.Decode(&cm.currentTerm); err != nil {
  } else {
    log.Fatal("currentTerm not found in storage")
  if votedData, found := cm.storage.Get("votedFor"); found {
    d := gob.NewDecoder(bytes.NewBuffer(votedData))
    if err := d.Decode(&cm.votedFor); err != nil {
  } else {
    log.Fatal("votedFor not found in storage")
  if logData, found := cm.storage.Get("log"); found {
    d := gob.NewDecoder(bytes.NewBuffer(logData))
    if err := d.Decode(&cm.log); err != nil {
  } else {
    log.Fatal("log not found in storage")

The mirror method is persistToStorage – it encodes and saves all these
state variables to the provided Storage:

func (cm *ConsensusModule) persistToStorage() {
  var termData bytes.Buffer
  if err := gob.NewEncoder(&termData).Encode(cm.currentTerm); err != nil {
  cm.storage.Set("currentTerm", termData.Bytes())

  var votedData bytes.Buffer
  if err := gob.NewEncoder(&votedData).Encode(cm.votedFor); err != nil {
  cm.storage.Set("votedFor", votedData.Bytes())

  var logData bytes.Buffer
  if err := gob.NewEncoder(&logData).Encode(cm.log); err != nil {
  cm.storage.Set("log", logData.Bytes())

We implement persistence simply by calling pesistToStorage at every point
in which these state variables change. If you look at the diff between the
CM’s code in Part 2 and this part, you’ll see them sprinkled in a handful of

Naturally, this is not the most efficient way to do persistence, but it’s simple
and it works, so it’s enough for our needs here. The largest inefficiency is
saving the whole log, which can be large in real applications. To really address
this, Raft has a log compaction mechanism which is described in section 7 of
the paper. We’re not going to implement compaction, but feel free to add it to
our implementation as an exercise.

Crash resiliency

With persistence implemented, our Raft cluster becomes somewhat resilient to
crashes. As long as a minority of peers in a cluster crashes and restarts at
some later point, the cluster will remain available to clients (perhaps after a
short delay where a new leader is elected, in case the leader was one of the
crashed peers). As a reminder, a Raft cluster with 2N+1 servers will tolerate
N failed servers and will remain available as long as the other N+1 servers
remain connected to each other.

If you look at the tests for this part, you’ll notice that many new tests were
added. Crash resiliency makes it possible to test a much larger portfolio of
contrived scenarios which are also described in the paper to some degree. It’s
highly recommended to run a couple of crash tests and observe what’s happening.

Unreliable RPC delivery

Since we’re ramping up testing in this part, another aspect of resiliency I’d
like to bring to your attention is unreliable RPC delivery. So far we’ve assumed
that RPCs sent between connected servers will arrive to their destination,
perhaps with a small delay. If you look in server.go, you’ll notice it uses
a type called RPCProxy to implement these delays, among other things. Each
RPC is delayed by 1-5 ms to simulate the real world for peers located in the
same data center.

Another thing RPCProxy lets us implement is optional unreliable delivery.
With the RAFT_UNRELIABLE_RPC env var on, once in a while RPCs will be
delayed significantly (by 75 ms), or dropped altogether. This simulates
real-world network glitches.

We can rerun all our tests with RAFT_UNRELIABLE_RPC on and observe how
the Raft cluster behaves in the presence of these faults – another highly
recommended exercise. If you’re feeling extra motivated, adjust RPCProxy
to not only delay RPC requests, but also RPC replies. This should require just
a handful of additional lines of code.

Optimizing sending AppendEntries

The current leader implementation has a major inefficiency, as I’ve briefly
noted in Part 2.
Leaders send AEs in leaderSendHeartbeats, which is invoked by a ticking
timer every 50 ms. Suppose a new command is submitted; instead of notifying
followers about it immediately, the leader will wait until the next 50 ms
boundary. It gets even worse, because two AE round-trips are needed to notify
followers that a command is committed. Here’s a diagram showing how it works
right now:

Timing diagram with AE on 50 ms boundaries

At time (1) the leader sends a heartbeat AE to a follower, and gets a response
back within a few ms. A new command is submitted, say, 35 ms later. The leader
waits until (2) the next 50 ms boundary to send the updated log to the follower.
The follower replies that the command was added to the log successfully (3). At
this point the leader has advanced its commit index (assuming it got a majority)
and can immediately notify followers, but it waits until the next 50 ms
boundary (4) to do so. Finally, when the follower receives the updated
leaderCommit it can notify its own client about a new committed command.

Much of the time passed between Submit(X) at the leader and commitChan <-
at the follower is an unnecessary artifact of our implementation.

What we really want is for the sequence to look like this, instead:

Timing diagram with AE on 50 ms boundaries

This is exactly what the code for this part does. Let’s see the new
parts of the implementation, starting with startLeader. As usual, the lines
that are different from the previous part are highlighted:

func (cm *ConsensusModule) startLeader() {
  cm.state = Leader

  for _, peerId := range cm.peerIds {
    cm.nextIndex[peerId] = len(cm.log)
    cm.matchIndex[peerId] = -1
  cm.dlog("becomes Leader; term=%d, nextIndex=%v, matchIndex=%v; log=%v", cm.currentTerm, cm.nextIndex, cm.matchIndex, cm.log)

  // This goroutine runs in the background and sends AEs to peers:
  // * Whenever something is sent on triggerAEChan
  // * ... Or every 50 ms, if no events occur on triggerAEChan
  go func(heartbeatTimeout time.Duration) {
    // Immediately send AEs to peers.

    t := time.NewTimer(heartbeatTimeout)
    defer t.Stop()
    for {
      doSend := false
      select {
      case <-t.C:
        doSend = true

        // Reset timer to fire again after heartbeatTimeout.
      case _, ok := <-cm.triggerAEChan:
        if ok {
          doSend = true
        } else {

        // Reset timer for heartbeatTimeout.
        if !t.Stop() {

      if doSend {
        if cm.state != Leader {
  }(50 * time.Millisecond)

Instead of just waiting for a 50 ms ticker, the loop in startLeader is
waiting on one of two possible events:

  • A send on cm.triggerAEChan
  • A timer counting 50 ms

We’ll see what triggers cm.triggerAEChan soon. This is the signal that an
AE should be sent now. The timer resets whenever the channel is triggered,
implementing the heartbeat logic – if the leader has nothing new to report, it
will wait at most 50 ms.

Note also that the method that actually sends the AEs is renamed from
leaderSendHeartbeats to leaderSendAEs, to better reflect its purpose in
the new code.

One of the methods that triggers cm.triggerAEChan is, as we’d expect,

func (cm *ConsensusModule) Submit(command interface{}) bool {
  cm.dlog("Submit received by %v: %v", cm.state, command)
  if cm.state == Leader {
    cm.log = append(cm.log, LogEntry{Command: command, Term: cm.currentTerm})
    cm.dlog("... log=%v", cm.log)
    cm.triggerAEChan <- struct{}{}
    return true

  return false

The changes are:

  • Whenever a new command is submitted, cm.persistToStorage is called to
    persist the new log entry. This is not related to the heartbeat optimization,
    but I point it out here anyway because it’s wasn’t done in Part 2 and was
    described earlier in this post.
  • An empty struct is sent on cm.triggerAEChan.
    This will notify the loop in the leader goroutine.
  • The lock handling is reordered a bit; we don’t want to hold the lock while
    sending on cm.triggerAEChan since this can cause a deadlock in some cases.

Can you guess where the other place in the code where cm.triggerAEChan
would be notified?

It’s in the code that handles AE replies in the leader and advances the commit
index. I won’t reproduce the whole method here, only the small part of the
code that changes:

  if cm.commitIndex != savedCommitIndex {
    cm.dlog("leader sets commitIndex := %d", cm.commitIndex)
    // Commit index changed: the leader considers new entries to be
    // committed. Send new entries on the commit channel to this
    // leader's clients, and notify followers by sending them AEs.
    cm.newCommitReadyChan <- struct{}{}
    cm.triggerAEChan <- struct{}{}

This is a significant optimization that makes our implementation react to new
commands much faster than before.

Batching command submission

The code in the previous section may have left you feeling a bit uncomfortable.
There’s a lot of activity now being triggered by each call to Submit – the
leader immediately broadcasts RPCs to all followers. What happens if we want to
submit multiple commands at once? The network connecting the Raft cluster will
likely get flooded by RPCs.

While it may seem inefficient, it’s actually safe. Raft RPCs are all
idempotent, meaning that getting an RPC with essentially the same information
multiple times does no harm.

If you’re worried about the network traffic in the presence of frequent submits
of many commands at once, batching should be easy to implement. The simplest
way to do this is to provide a way to pass a whole slice of commands into
Submit. Very little code in the Raft implementation has to change as a
result, and the client will be able to submit a whole group of commands without
incurring too much RPC traffic. Try it as an exercise!

Optimizing AppendEntries conflict resolution

Another optimization I’d like to discuss in this post is for reducing the number
of rejected AEs required for a leader to bring a follower up-to-date in some
scenarios. Recall that the nextIndex mechanism begins at the very end of
the log and decrements by one each time a follower rejects an AE. In rare cases
the follower can be severely out of date, and the process to update it will
take a long time because each RPC round-trip only advances by one entry.

The paper describes this optimization at the very end of section 5.3, but
doesn’t provide much details about implementing it. To implement this, we’ll
extend the AE reply message with new fields:

type AppendEntriesReply struct {
  Term    int
  Success bool

  // Faster conflict resolution optimization (described near the end of section
  // 5.3 in the paper.)
  ConflictIndex int
  ConflictTerm  int

You can see the additional changes in the code for this part. Two places have to

  • AppendEntries is the AE RPC handler; when followers reject an AE, they
    fill in ConflictIndex and ConflictTerm.
  • leaderSendAEs is updated at the point where it receives these AE replies,
    and uses ConflictIndex and ConflictTerm to backtrack nextIndex
    more efficiently.

The Raft paper says:

In practice, we doubt this optimization is necessary, since failures happen
infrequently and it is unlikely that there will be many inconsistent entries.

And I absolutely agree. To be able to test this optimization, I had to come up
with a rather contrived test; IMHO the chances of such scenarios happening in
real life are very low, and the one-time gain of a couple hundred milliseconds
doesn’t warrant the code complication. I’m showing it here just as an example
of the many optimizations that can be applied to the uncommon cases in Raft;
in terms of coding, it’s a neat example of how the Raft algorithm can be
slightly modified to change its behavior in some corner cases.

Raft was designed to make the common case fairly fast, at the expense of
performance in uncommon cases (where failures actually happen). I believe this
is the absolutely correct design choice. Optimizations like the more immediate
AE delivery described in the previous section are essential, because they
directly affect the common path.

On the other hand, optimizations like conflict indices for faster backtracking
are, while technically interesting, not really important in practice because
they provide a limited benefit in scenarios that happen during <0.01% of the
lifetime of a typical cluster.


This concludes our series of 4 posts about the Raft distributed consensus
algorithm. Thanks for reading!

For any questions or comments about these posts or the code, please send me an
email or open an issue on Github.

If you’re interested in checking out industrial-strength, battle tested
implementations of Raft in Go, I recommend:

  • etcd/raft is the Raft
    part of etcd, which is a distributed key-value database.
  • hashicorp/raft is a standalone Raft
    consensus module that can be tied to different clients.

These implement all the features of the Raft paper, including:

  • Section 6: Cluster membership changes – if one Raft server comes offline
    permanently, it’s useful to be able to replace it with another without
    bringing the whole cluster down.
  • Section 7: Log compaction – in real applications the log grows very large
    and it becomes impractical to fully persist it for every change or fully
    replay it in case of crashes. Log compaction defines a checkpointing mechanism
    that makes it possible for Raft clusters to replicate very large logs

Louis Vuitton is painting boarded up windows its signature orange shade, revealing a dystopian new reality prior to a historic election

Louis Vuitton is painting boarded up windows its signature orange shade, revealing a dystopian new reality prior to a historic election

Analysis banner
Louis Vuitton election
A man walks past a boarded-up window at a Louis Vuitton store in San Francisco, Sunday, Nov. 1, 2020, ahead of Election Day.

  • Louis Vuitton is using plywood and steel that is painted its signature shade of orange to board up stores prior to possible protests, riots, and looting following the election. 
  • The Louis Vuitton-branded boards first made their appearance early in the pandemic. 
  • The dedication to aesthetic sends an unnerving message — that no amount of changes, protests, or riots will even temporarily stop brands like Louis Vuitton from trying to sell things. 
  • Visit Business Insider’s homepage for more stories.

Louis Vuitton is painting boarded up windows its signature shade of orange, as the luxury retailer prepares for possible unrest around the 2020 presidential election. 

Shopping destinations across America are bracing for the election, Business Insider’s Thomas Pallini reports.

Rodeo Drive in Beverly Hills is shutting down completely. In Chicago, the Magnificent Mile will be armed with “everything from snow plows to salt trucks” to control crowds, Rich Gamble, chairman of the Magnificent Mile Association, told Bloomberg. Plywood and steel are in high demand, CBS News reports, as stores board up their windows to protect against riots and looting. 

Read more: Impeachments, intraparty warfare, and a run on antidepressants: Democrats contemplate their ultimate nightmare scenario of Trump winning a 2nd term

Yet, despite these concerns, some brands remain dedicated to their established aesthetic. 

Prime among the luxury brands refusing to allow potential unrest to disrupt their branding is Louis Vuitton. The retailer has been painting steel and plywood barriers its signature shade of orange at stores across the US.

It shows a staunch commitment to the brand. But it can also come across as deeply dystopian.

Income inequality has been front and center this election, but that hasn’t stopped some retailers from doubling down on luxury branding, even though it may seem out of step with the average shoppers’ concerns.

Louis Vuitton
A man walks past a boarded up window at a Louis Vuitton store in San Francisco, Sunday, Nov. 1, 2020.

With income inequality reaching record highs in recent years, addressing the divisions between the rich and poor in the US emerged as one of the biggest topics of the election.  In February 2019, Insider’s Eliza Relman reported that taxation and inequality were the center of the 2020 race. 

“Class war is the only war that’s necessary and apparently the only one conservatives wouldn’t support waging for two decades without end,” Sean McElwee, a progressive activist and cofounder of Data for Progress, told Relman at the time. 

The pandemic and its “K-shaped” recovery have exacerbated many of these divisions.

louis vuitton
Message of support, Louis Vuitton store during Pandemic, Chicago, Illinois.

Some experts say that the US could see a “K-shaped” recovery from the pandemic, as wealthy professionals recover but those working low-paying jobs do not. Former Vice President Joe Biden has repeatedly referenced the concept while campaigning. 

“Billionaires have made another $300 billion because of his profligate tax proposal, and he only focused on the market,” Biden said in the first presidential debate. “But you folks at home, you folks living in Scranton and Claymont and all the small towns and working class towns in America, how well are you doing?” 

Louis Vuitton first debuted its signature boarded up windows early in the pandemic, when stores closed due to COVID shutdown.

louis vuitton
Some high end stores like Louis Vuitton on Greene Street boarding up with plywood windows and entrances in late March.

Louis Vuitton, like many luxury retailers, were forced to close stores around the world because of the pandemic. 

Parent company LVMH reported in October that the fashion brand’s sales have recovered from the early pandemic. The company reported revenues of 11.96 billion euros, equivalent to $13.99 billion, in the third quarter, The Wall Street Journal announced. 

After Louis Vuitton stores reopened, some boarded up windows again during protests over the summer.

louis vuitton
A Louis Vuitton storefront remains vandalized after a night of largely peaceful protests descended into chaos and violent confrontations in lower Manhattan on June 1, 2020.

Louis Vuitton faced pushback when the brand did not immediately speak out in the days following the death of George Floyd, as other companies made statements condemning police brutality and white supremacy.

Backlash grew after designer Virgil Abloh publicly criticized people looting shops, followed by a donation of just $50 to a bail fund. (Abloh and Vuitton released a video in support of Black Lives Matter in late May.)

Boarding up windows can be a way to protect merchandise, as well as make workers’ jobs easier and safer.

Beverly Hills, election
Gregg Donovan, former Ambassador of Beverly Hills, stands at a closed and boarded up Via Rodeo before Election Day.

Louis Vuitton is far from the only brand boarding up windows and taking precautions, both during the protests over the summer and before the election. 

Axios reports that protests cost more than $1 billion in damages across the US in a 13-day period from May 26 to June 8. 

However, the dedication to branding — down to the shade of orange — can be unnerving.

Boarded up stores Election Day

Acclimating to the “new normal,” for the sake of executives, investors, and even employees, makes sense. But, it can also feel dystopian, a queasy attempt to force something from another time to function even when things have changed for the worse. 

The orange-tinted barriers feel like a ramped-up version of brands co-opting the hellish nature of 2020 in ad campaigns. It’s two things that feel inherently at odds — 2020 being horrible, the need to advertise — and cobbling them together into a twisted version of normalcy. 

Stores need protection. But, luxury brands rely on their aesthetic to be able to charge $3,000 for a handbag. So, we have orange plywood barriers outside Louis Vuittons across the US. 

In some ways, the barriers are a microcosm of what is wrong in America as the nation heads into the election.

Louis Vuitton
A Louis Vuitton storefront is shown heavily damaged in early June.

There is a reason why populists on both sides of the aisle have found support, even as some of the right argue that Democrats like Bernie Sanders, Elizabeth Warren, and Alexandria Ocasio-Cortez fan the flames of class warfare. Major businesses and wealthy Americans have flourished in recent years, enjoying the success that is at odds with reality for most people in the country. 

Income has basically stagnated for most workers when adjusted for inflation, but net productivity has been growing by 70% since the early ’70s. In the pandemic, private jet travel skyrocketed as millions of people lost their jobs. Small businesses are struggling to survive, but shares of the largest companies in the nation are set to explode. 

Louis Vuitton did not respond to Business Insider’s request for comment on its barriers, which are ultimately pretty harmless. But, they do show just how dystopian this year can be. 

Even as the country prepares for vote-counting delays, riots, and a continuing pandemic, absolutely nothing will stop brands from trying to sell people things in the “new normal.” 

Read the original article on Business Insider
 Image of businessinsider?d=yIl2AUoC8zA  Image of businessinsider?i=ZJOYuYpRUG4:QKHmgwugUDA:F7zBnMyn0Lo  Image of businessinsider?i=ZJOYuYpRUG4:QKHmgwugUDA:V sGLiPBpWU  Image of businessinsider?d=qj6IDK7rITs  Image of businessinsider?i=ZJOYuYpRUG4:QKHmgwugUDA:gIN9vFwOqvQ

 Image of ZJOYuYpRUG4

5G, Robotics, AVs, and the Eternal Problem of Latency

5G, Robotics, AVs, and the Eternal Problem of Latency

.detail-wrapper .article-detail .media-wrapper .buttons {
font-family: Theinhardt-Regular,sans-serif;
display: none;
clear: both;
padding: 10px 0;
overflow: hidden;
.article-detail article iframe, .article-detail iframe {
display: block;
margin: auto;
margin-bottom: 40px;

Steven Cherry Hi, this is Steven Cherry for Radio Spectrum.

In the winter of 2006 I was in Utah reporting on a high-speed broadband network, fiber-optic all the way to the home. Initial speeds were 100 megabits per second, to rise tenfold within a year.

I remember asking one of the engineers, “That’s a billion gigabits per second—who needs that?” He told me that some studies had been done in southern California, showing that for a orchestra to rehearse remotely, it would need at least 500 megabits per second to avoid any latency that would throw off the synchronicity of a concert performance. This was fourteen years before the coronavirus would make remote rehearsals a necessity.

You know who else needs ultra-low latency? Autonomous vehicles. Factory robots. Multiplayer games. And that’s today. What about virtual reality, piloting drones, or robotic surgery?

What’s interesting in hindsight about my Utah experience is what I didn’t ask and should have, which was, “So what you really need is low latency, and you’re using high bandwidth as a proxy for that?” We’re so used to adding bandwidth when what we really need is to reduce latency that we don’t even notice we’re doing it. But what if enterprising engineers got to work on latency itself? That’s what today’s episode is all about.

It turns out to be surprisingly hard. In fact, if we want to engineer our networks for low latency, we have to reengineer them entirely, developing new methods for encoding, transmitting, and routing. So says the author of an article in November’s IEEE Spectrum magazine, “Breaking the Latency Barrier.”

Shivendra Panwar is a Professor in the Electrical and Computer Engineering Department at New York University’s Tandon School of Engineering. He is also the Director of the New York State Center for Advanced Technology in Telecommunications and the Faculty Director of its New York City Media Lab. He is also an IEEE Fellow, quote, “For contributions to design and analysis of communication networks.” He joins us by a communications network, specifically Skype.

Steven Cherry Shiv. Welcome to the podcast.

Shivendra Panwar Thank you. Great to be here.

Steven Cherry Shiv, in the interests of disclosure, let me first rather grandly claim to be your colleague, in that I’m an adjunct professor at NYU Tandon, and let me quickly add that I teach journalism and creative writing, not engineering.

You have a striking graph in your article. It suggests that VoIP, FaceTime, Zoom, they all can tolerate up to 150 milliseconds of latency, while for virtual reality it’s about 10 milliseconds and for autonomous vehicles it’s just two. What makes some applications so much more demanding of low latency than others?

Shivendra Panwar So it turns out, and this was actually news to me, is that we think that the human being can react on the order of 100 or 150 milliseconds.

We hear about fighter pilots in the Air Force who react within 100 ms or the enemy gets ahead of them in a dogfight. But it turns out human beings can actually react at even a lower threshold when they are doing other actions, like trying to touch or feel or balance something. And that can get you down to tens of milliseconds. What has happened is in the 1980s, for example, people were concerned about applications like the ones you mentioned, which required 100, 150 ms, like a phone call or a teleconference. And we gradually figured out how to do that over a packet-switched network like the Internet. But it is only recently that we became aware of these other sets of applications, which require an even lower threshold in terms of delay or latency. And this is not even considering machines. So there are certain mechanical operations which require feedback loops of the order of milliseconds or tens of milliseconds.

Steven Cherry Am I right in thinking that we keep throwing bandwidth at the latency problem? And if so, what’s wrong with that strategy?

Shivendra Panwar So that’s a very interesting question. If you think of bandwidth in terms of a pipe. Okay, so this is going back to George W. Bush. If you don’t remember this famous interview or debate he had and he likened the Internet to be a set of pipes and everyone made fun of him. Actually, he was not far off. You can make the analogy that the Internet is a set of pipes. But coming back to your question, if you view the Internet as a pipe, there are two dimensions to a pipe, it’s the diameter of the pipe, how wide it is, how fat it is, and then there’s the length of the pipe. So if you’re trying to pour … if you think of bits as a liquid and you’re trying to pour something through that pipe, the rate at which you’d be able to get it out at the other end or how fast you get it out of the other end depends on two things—the width of the pipe and the length of the pipe. So if you have a very wide pipe, you’ll drain the liquid really fast. So that’s the bandwidth question. And if you shorten the length of the pipe, then it’ll come out faster because it has less length of pipe to traverse. So both are important. So bandwidth certainly helps in reducing latency if you’re trying to download a file, for example, because the pipe width will essentially make sure you can download the file faster.

But it also matters how long the pipe is. What are the fixed delays? What are the variable delays going through the Internet? So both are important.

Steven Cherry You say one big creator of latency is congestion delays. To use the specific metaphor of the article, you describe pouring water into a bucket that has a hole in it. If the flow is too strong, water rises in the bucket and that’s congestion delay, water droplets—the packets, in effect—waiting to get out of the hole. And if the water overflows the bucket, if I understand the metaphor, those packets are just plain lost. So how do we keep the water flowing at one millimeter of latency or less?

Shivendra Panwar So that’s a great question. So if you’re pouring water into this bucket with a hole and you want to keep—first of all, you want to keep it from overflowing. So that was: Don’t put too much water because even in the bucket, the hole will gradually fill up and overflow from the top. But the other and equally important issue is you want to fill the bucket, maybe just the bottom of the bucket. You know, just maybe a little bit over that hole so that the time it takes for water that you are pouring to get out is minimized.

And that’s minimizing the queuing delay, minimizing the congestion and minimizing ultimately the delay through the network. And so if you want it to be less than a millisecond, you want to be very careful pouring water into that bucket so that just fields or uses the capacity of that hole but not starts filling up the bucket.

Steven Cherry Phone calls used to run on a dedicated circuit between the caller and the receiver. Everything runs on TCP now, I guess in hindsight, it’s remarkable that we can even have phone calls and Zoom sessions at all with our voices and video chopped up into packets and sent from hop to hop and reassembled at a destination. At a certain point, the retuning that you’re doing of TCP starts to look more and more like a dedicated circuit, doesn’t it? And how do you balance that against the fundamental point of TCP, which is to keep routers and other midpoints available to packets from other transmissions as well as your own?

Shivendra Panwar So that is the key point that you have mentioned here. And this was, in fact, a hugely controversial point back in the ’80s and ’90s when the first experiments to switch voice from circuit-switched networks to packet-switched networks was first considered. And there were many diehards who said you cannot equal the latency and reliability of a circuit switch network. And to some extent, actually, that’s still right. The quality on a circuit switch line, by the time of the 1970s and 1980s, when it had reached its peak of development was excellent.

And sometimes we struggle to get to that quality today. However, the cost issue overrode it. And the fact that you are able to share the network infrastructure with millions and now billions of other people made the change inevitable. Now, having said that, this seems to be a tradeoff between quality and cost. And to some extent it is. But there is, of course, a ceaseless effort to try and improve the quality without giving up anything on the cost. And that’s where the engineering comes in. And that’s where monitoring what’s happening to your connection on a continuous basis so that whenever you sense that congestion is building up, what TCP does in particular is to back off or reduce the rate so that it does not contribute to the congestion. And its vital bits get through in time.

Steven Cherry You make a comparison to shoppers at a grocery store picking which checkout lane has the shorter line or is moving faster. Maybe as a network engineer, you always get it right, but I often pick the wrong lane.

Shivendra Panwar That is that is true in terms of networking as well, because there are certain things that you cannot predict. You might go to two lines in a router, for example, and one may look a lot shorter than yours. But there is some hold up, right? A packet may need extra processing or some other issue may crop up. And so you may end up spending more time waiting on a line which initially appeared short to you. So there is actually a lot of randomness in networking. In fact, a knowledge of probability theory, queuing theory, and all of this probabilistic math, is the basis of engineering networks.

Steven Cherry Let’s talk about cellular for a minute. The move to 5G will apparently help us reduce latency by reducing frame durations, but apparently it also potentially opens us up to more latency because of its use of millimeter waves?

Shivendra Panwar That is indeed a tradeoff. The engineers who work at the physical layer have been working very hard to increase the bandwidth to get us into gigabits per second at this point in 5G and reduce the frame lengths—the time you spend waiting to put your bits onto the channel is reduced. But in this quest for low bandwidth, they moved up the electromagnetic spectrum to millimeter waves, which have a lot more capacity but have poorer propagation characteristics. In the millimeter waves, what happens is it can no longer go through the wall of a building, for example, or even the human body or a tree. If you can imagine yourself, let’s say you’re in Times Square before code and you’re walking with your 5G phone, every passerby, or every truck rolling by, would potentially block the connection between your cell phone and the cell tower. Those interruptions are driven by the physical world. In fact, I joke this is the case of Sir Isaac Newton meeting Maxwell, the electromagnetic guru. Because what happens is those interruptions, since they are uncontrollable essentially, you can get typical interruptions of the order of half a second or a second before you switch to another base station, which is the current technology, and find another way to get your bits through. So those blockages, unfortunately, can easily add a couple of hundred milliseconds of delay because you may not have an alternate way to get your bits through to the cell tower.

Steven Cherry I guess that’s especially important, not so much for phone conversations and Web use or whatever we’re using our phones for, where, as we said before, a certain amount of latency is not a big problem. But 5G is going to be used for the Internet of Things. And there, there will be applications that require very low latency.

Shivendra Panwar Okay, so there are some relatively straightforward solutions. If your application needs relatively low bandwidth, so, many of the IoT applications need kilobits per second, which is a very low rate. What you could do is you could assign those applications to what is called sub-six gigahertz. That is the frequency that we currently use. Those are more reliable in the sense that they penetrate buildings, they penetrate the human body.

And as long as your station has decent coverage, you can have more predictable performance. It is only as we move up the frequency spectrum and we try and send both broadband applications—applications that use a gigabit per second or more—and we want the reliability and the low latency that we start running into problems.

Steven Cherry I noticed that, as you alluded to earlier, there are all sorts of applications where we would benefit from very low latency or maybe can’t even tolerate anything but very low latency. So to take just one example, another of our colleagues, a young robotics professor at NYU, Tandon is working on exoskeletons and rehabilitation robots for Parkinson’s patients to help them control hand tremors. And he and his fellow researchers say, and I’m quoting here, “a lag of nearly 10 or 20 milliseconds can afford effective compensation by the machine and in some cases may even jeopardize safety.”

So are there latency issues even within the bus of an exoskeleton or a prosthetic device that they need to get down to single-digit millisecond latency?

Shivendra Panwar That sounds about right in terms of the 10 to 20 milliseconds or perhaps even less. There is one solution to that, of course, is to make sure that all of the computational power—all of the data that you need to transmit—stays on that human subject (the person who’s using the exoskeleton) and then you do not depend on the networking infrastructure. So that will work. The problem with that is the compute power and communications will, first of all, be heavy, even if we can keep reducing that thanks to Moore’s Law, and also drains a lot of battery power. One approach is seeing if we can get the latency and reliability right, is to offload all of that computation to, let’s say, the nearest base station or a wireless Wi-Fi access point. This will reduce the amount of weight that you’re carrying around in your exoskeleton and reduce the amount of battery power that you need to be able to do this for long periods of time.

Steven Cherry Yeah, something I hadn’t appreciated until your article was, you say that ordinary robots as well could be lighter and have greater uptime and might even be cheaper with ultra low latency.

Shivendra Panwar That’s right. Especially if you think of flying robots. Right? You have UAVs. And there, weight is paramount to keep them up in the air.

Steven Cherry As I understand it, Shiv, there’s a final obstacle, or limit at least, to reducing latency. And that’s the speed of light.

Shivendra Panwar That’s correct. So most of us are aware of the magic number, which is like 300 000 km/s. But that’s in a vacuum or through free space. If you use a fiber-optic cable, which is very common these days, that goes down to 200 000 km/s. So you used to always take that into account but it was not a big issue.

But now, if you think about it, if you are trying to aim for a millisecond delay, let’s say, that is a distance of quote unquote, only 300 km that light controls in free space or even less on a fiber optic cable—down to 200 kilometers. That means you cannot do some of the things we’ve been talking about sitting in New York, if it happens to be something that we are controlling in Seoul, South Korea, right? The speed of light takes a perceptible amount of time to get there. Similarly, what has been happening is the service providers who want to host all these new applications now have to be physically close to you to meet those delay requirements. Earlier, we didn’t consider them very seriously because there are so many other sources of delays and delays were of the order of a hundred milliseconds—up to a second, even, if you think further back—that a few extra milliseconds didn’t matter. And so you could have a server farm in Utah dealing with the entire continental U.S., that would be sufficient. But that is no longer possible.

And so a new field has come up—edge computing, which takes applications closer to the edge in order to support more of these applications. The other reason to consider mobile computing is you can keep the traffic off the Internet core, if you push it closer to the edge. For both those reasons, computation may be coming closer and closer to you in order to keep the latency down and to reduce costs.

Steven Cherry Well, Shiv, it seems we have a never-ending ebb and flow between putting more of the computing at the endpoint, and then more of the computing at the center, from the mainframes and dumb terminals of the 1950s, to the networked workstations ’80s, to cloud computing today, to putting AI inside of IoT nodes tomorrow, but all through it, we always need the network itself to be faster and more reliable. Thanks for the thankless task of worrying about the network in the middle of it all, and for being my guest today.

Shivendra Panwar Thank you. It’s been a pleasure talking to Steve.

Steven Cherry We’ve been speaking with IEEE Fellow Shivendra Panwar about his research into ultra-low-latency networking at NYU’s Tandon School of Engineering.

This interview was recorded October 27, 2020. Our thanks to Mike of Gotham Podcast Studio for our audio engineering; our music is by Chad Crouch.

Radio Spectrum is brought to you by IEEE Spectrum, the member magazine of the Institute of Electrical and Electronic Engineers, a professional organization dedicated to advancing technology for the benefit of humanity.

For Radio Spectrum, I’m Steven Cherry.

 Image of CBH sR8dbjQ