Today Steve, Rajesh and I co-published a blog post at Hortonworks.com. The topic is about Amazon S3 consistency model challenges to Hadoop applications. Many Hadoop applications runnin in Amazon Web Services have been using S3 as the direct destination of work. The fact that the API gives it the appearance of a filesystem means that people can try to use it to replace HDFS as the destination of Hive, Spark and MapReduce queries. This is something which appears to work, albeit slowly, but which is insidiously dangerous (due to the S3 eventual consistency model which is not the same requirement as a filesystem). In that blog, we proposed S3Guard and how it helps eliminates this consistency situation.

I don’t want to copy and paste the whole article here, please visite the link: https://hortonworks.com/blog/s3guard-amazon-s3-consistency/ But I can attach the major picture here so you can have a look at the overall architecture quickly.

S3Guard architecture

All this work benefits from collective efforts in Apache Hadoop community. Hortonworks and our BBF Cloudera both participated in the design and implementation actively. If you’re interested, you can find the implementation detaisl at HADOOP-13345. On Jun 13 2017, Ram and I gave a talk at DataWorks Summit San Jose, I talked about S3Guard and the context. Please find following video: