An elephant keeper tells me his HDFS NameNodes are failing over too frequently. He’s concerned about this because it’s a sign of something wrong in the High Availability (HA) cluster.

If you don’t want failover to happen that fast, I told him, you can simply increase config key to whatever you want. Joking aside, the above is something that can help mitigate, but that would be a temporary fix rather than addressing root cause. There are several common causes of NameNode stalls and failovers. The first task is to find evidence of them happening in the NN logs.

  1. Garbage Collection Pauses: There may be several occurrences of “JvmPauseMonitor” in the NameNode logs. Also the GC log of both NN will show long intervals (see this article). This is not concerning if the duration is just a second or two. But if this is very related, don’t hesitate to check the GC related settings (refer to this blog post).
  2. Slow Group Mapping Lookups: We see this in cases where a user’s group lookups are slow, such as due to an overloaded LDAP server. When this happens, the group lookup logic will log a message containing “Potential performance problem”. I remember one elephant keeper changed her config key and LDAP requests are thus overwhelming. Don’t change this setting unless you are very sure what you are doing.
  3. Excessive Logging: The excessive logging problems I have seen will make the NN very slow. This sometimes is also causing long pause GCs. For this, generally we can:

    • Change BlockStateChange logging from INFO level to WARN level by modifying log4j.logger.BlockStateChange=WARN
    • Change SecurityLogger from INFO level to WARN level in log4j.category.SecurityLogger=WARN,RFAS
    • Alternatively you can change the logging level on the fly:
      hadoop daemonlog -setlevel <nn-host-name>:50070 BlockStateChange ERROR
  4. RPC Load: The most likely root cause is high RPC load blocking delivery of the HA health check message. To check this, we can capture JMX output from both NameNodes. The high RPC load can also be nicely confirmed by analysing HDFS audit logs with the time of the failover happens. The high RPC load appears to be caused by _1)_ a particularly heavy job, _2)_ long I/O latency flushing transactions to disk (or a combination of both factors), or _3)_ slow processing code in NN (possibly a bug).
    For this we can do quite a few works. For example, to increase RPC handler count (RPC handler count be set to 20 * log2 Cluster Size with an upper limit of 200), to enable NN service port for DNs, to enable Lifeline protocol, to enable RPC congestion control etc - all are to make NN scale better. Please see my favorite blog post about this written by my colleague Arpit at Hortonworks. To confirm this specific issues, running the jstack periodically against NN will be very helpful. I believe any HDFS expert will at least provide some pointers when she has the jstack information.

Anyway this is a very challenging problem, so that the root cause and associated solutions can be very tricky. I remember one user is a big fan of getContentSummary on root (“/“) before every job submission. That would recursively scan the whole file system tree while holding the lock, so the NN seems unresponsive for a very long time. Another elephant keeper have all I/O on these namenodes – including logs, OS, Zookeeper etc happening to a single disk device (though multiple partitions).