One elephant keeper told me he needed to move 7PB data into archival storage and ran into performance issues with the Mover tool. This generally was a good question because he’s using tiered storage: HOT, WARM, COLD, All_SSD, One_SSD, Lazy_Persist. I made the old balancer joke again: your Mover is supposed to run slowly in the background so take it easy. Per the doc:

It periodically scans the files in HDFS to check if the block placement satisfies the storage policy. For the blocks violating the storage policy, it moves the replicas to a different storage type in order to fulfill the storage policy requirement. Note that it always tries to move block replicas within the same node whenever possible,

Haha, he agreed it’s a nice one. However, he’s not enjoying this humor because he had to move the backup data to the archival storage ASAP so that the SSDs can be freed up for production. It took around 10 minutes to move 40GB file. As a result, moving 7PB data would cost 3 years. Oh man. Am I doing math wrongly (again)? Talk is cheap, show me the commands.

1
2
3
hdfs mover -Ddfs.datanode.balance.bandwidthPerSec=134217728 \
-threshold 5.0 \
-p /backup

For those parameters, I said,

  • dfs.datanode.balance.bandwidthPerSec is a DataNode configuration. So we should change the value in DN side and remove it from the Mover command.
  • Mover does not use the -threshold $ {THRESHOLD} option. So they should remove it from the Mover command.
  • Meanwhile, we should move some sample data before we go with the 7PB data.

After changing the configurations and re-running, the elephant keeper was still not happy, though the Mover did run slightly faster. We then turned to the log. Usually I’d suggest export HADOOP_ROOT_LOGGER=DEBUG,console before running the Mover again.

1
2
3
4
egrep -o '\d{1,4}.\d{1,4}.\d{1,4}.\d{1,4}:50010:ARCHIVE' m.log | sort | uniq
192.168.54.40:50010:ARCHIVE
192.168.54.50:50010:ARCHIVE
192.168.54.60:50010:ARCHIVE

Though the customer had 20 archive data nodes, the mover was simply using a few of them. Specially, the host 172.168.54.40 was used heavily. This must be the root cause of this slow mover run. So I provided some early input while I was reviewing the log again.

  1. Double check that 20 archive DataNode are correctly configured
  2. To make the problem more representative, move larger files
  3. Print the cluster topology hdfs dfsadmin -printTopology and make sure it is correct
  4. Collect the NN log and JMX if possible while running the Mover again

When the elephant keeper got back, we phased out all other possibilities. In the meantime, I had a nice discussion with my colleague Nicholas. We found some interesting behaviors in Mover tool source code. The org.apache.hadoop.hdfs.server.mover.Mover$Processor::chooseTarget() always chose the first matching target DataNode from the candidate list. This might make the Mover schedule a lot of task to a few selected DataNodes (first several DataNodes of the candidate list). The overall performance would, as a result, suffer significantly from this because of the saturated network/disk usage on those DataNodes. Specially, even if we set the mover threads configure dfs.datanode.balance.max.concurrent.moves a larger value, the scheduled move task would be queued on a few of the storage group, regardless of other available storage groups.

The code is as following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
boolean chooseTarget(DBlock db, Source source,
List<StorageType> targetTypes, Matcher matcher) {
final NetworkTopology cluster = dispatcher.getCluster();
for (StorageType t : targetTypes) {
for(StorageGroup target : storages.getTargetStorages(t)) {
if (matcher.match(cluster, source.getDatanodeInfo(),
target.getDatanodeInfo())) {
final PendingMove pm = source.addPendingMove(db, target);
if (pm != null) {
dispatcher.executePendingMove(pm);
return true;
}
}
}
}
return false;
}

We now need an algorithm which can distribute the move tasks approximately even across all the candidate target storage groups. To achieve this, we can pick a random matching storage group for the given storage type. One implementation is to shuffle the candidate target storages before iterating them. To make my elephant keeper happy, I attached a source code patch which:

  • shuffles the candidate ARCHIVE datanode lists when choosing the target datanodes for mover
  • dumps detailed information for the Mover command when a block is scheduled

After the patch was applied, it took less than 10mins to moved 800GB data. Then I suggested to tune config dfs.mover.moverThreads and dfs.datanode.balance.max.concurrent.moves to further speed Mover up. Again, at Hortonworks we are 100% open source. I always appreciated the spirit and filed a JIRA HDFS-10335 in Apache Hadoop community along with a patch. If you’re using a Hadoop release before 2.7.3, you have to backport this to your Hadoop release in order to run Mover really faster.