An elephant keeper told me that he was trying to copy the data from his HDFS to S3 and he saw quite a few FileNotFoundException. However, when he checked the failing files immediately from Amazon S3 web console, he was able to see them in S3 Bucket. I then kindly asked him one question: Did you use the -p option in your Distcp command line? He said, yes, ‘cause he does not want to lose the file metadata so he thought it’s a good practise to keep file attributes when copying files.

Actually, when backing up data to S3, the -p options are ignored, including those to preserve permissions, user and group information, attributes checksums, and replication. So per the best doc I’ve read about this topic, I’d suggest the elephant keeper remove the -p option as it’s “useless” anyway.

He kindly asked, being useless does not imply harmful, right? What the heck is the FNFE? So if we run Distcp with -p option, after copying files, Distcp will call getFileStatus to get the FileStatus from the destination so that it can compare it to the source and update metadata if necessary. Here comes the interesting part: Amazon S3 has an eventual consistency model, which can cause the getFileStatus call for a newly created file in S3 to fail with FileNotFoundException. If this happens (not frequently of course), the whole MapReduce task fails, retries, and repeats copying all the data over again.

He kindly asked again, do I have your word that removing -p option will avoid the following getFileStatus after copying data? This reminds me something. Before HADOOP-13145 was committed, the getFileStatus was there even if the DistCp command was run without -p option, in which case the additional getFileStatus call is wasteful (and harmful). So I checked his HDP version, and fortunately he will be fine. If you have your own Hadoop version and release, you probably have been basing on Apache Hadoop 2.7.3, which unfortunately does not have this fix. As Hadoop 2.8 is out, you can try it for sure.

I know this is not a complete solution though, because many other file system operations are still impacted by S3 eventual consistency. For this, please have a look at HADOOP-13345 in Apache community.