An elephant keeper tells me his HDFS balancer is slow and he can’t sleep well at night. He asks me if I can help speed it up.

OK, by design the HDFS balancer runs slowly in background, balancing the whole cluster periodically. It’s fine to be slow, I tell him, so that it does not affect the normal cluster activities. Your users submit jobs, copy datas in and out, and operate the cluster for fun, without knowing that a balancer is running in the meantime. So go to sleep and sleep well. Don’t worry about slow balancer.

No, he says, the cluster is very imbalanced - he just gets dozens of new DataNode servers this week. He has to do something to make the balancer run faster.

OK, let’s revisit the parameters of the balancer and the datanodes. The first config dfs.datanode.balance.max.concurrent.moves is to limit the maximum number of concurrent block moves that one Datanode is allowed for balancing the cluster. Its effective value for the balancer will be the minimum value of balancer parameter and datanode configs. The other datanode side config is dfs.datanode.balance.bandwidthPerSec. Oh… we get into configuration details and details get stale as the Apache Hadoop evolves. Why not refer to my favorate articles on this subject:

Finally, he’s using the following commands to run the balancers:

1
2
3
4
5
$ hdfs dfsadmin -setBalancerBandwidth 1073741824
$ nohup hdfs balancer \
-Ddfs.datanode.balance.max.concurrent.moves=10 \
-Ddfs.balancer.dispatcherThreads=1024 \
-Ddfs.balance.bandwidthPerSec=1073741824

OK, it looks good to me. Obviously he will notice high NN RPC, when balancer is executing.

Shortly he finds this does not make the balancer embarrassingly faster, as it hangs with following jstack:

1
2
3
4
5
6
7
8
9
10
11
"main" #1 prio=5 os_prio=0 tid=0xooxx nid=0xooxx waiting on condition [0xooxx]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043)
at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017)
at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981)
at org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611)
at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663)
at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905)

Oh, that is not cool. What about the balancer log? The clue we get is that the balancer Dispatcher complains a lot of warnings saying No mover threads available. Right, the best idea is to restart the balancer. It’s OK to kill the balancer anytime and start it again. The balancer internal state will be reset and hopefully he won’t get this problem again.

I know restarting is not a complete solution, though it’s the best solution in many other cases. After some study, I find in community HDFS-11377 has solved this problem by removing the pending move block from wait list. Simple and clear. After all, who cares about few failing moves/datanodes in a balancer iteration? I backport this to his HDP version and it looks good for a week. He thanks me while I’d thank the open-source Apache community.