< back to index

480px | 640px | 800px | 1024px | full

GORA-174: GORA compiler does not handle ["string", "null"] unions in the AVRO schema

feb-2013

The aim of the patch is to make the compiler handle Avro unions in schemas. This will make possible optional fields.

Avro specification shows how are UNIONs serialized: first goes a long with the index of the union element, and second goes the value. Specifically for HBase, the value in a column will be first a byte as index, followed by serialized bytes of data.

In an Avro schema, the first element in the union is considered the default type. At this moment is not implemented (nor planned).

After implementing this fix, it is warned not modifying production schemas to add optional fields because data will be incompatible. This is a big issue with Nutch, since in NUTCH-1477 webpage's schema will be modified and will break data compatibility AT LEAST in HBase.

The issue in HBase backend

HBase backend writes in 2 levels: In Gora 0.2.1, given a HBase <family:column> , if data being written is not a record,map or array (so, a basic type), the data is raw written. If you have a value composed of two or more nested levels for a <family:column> , data gets serializaed with Avro so it fits in one column.

After implementing GORA-174, the first level data will not be raw written. Here is an example. For come column called 'mytext':

 fam:mytext = This is the text

Updating 'mytext' to be optional modifying the schema to look as ["null","string"], the codification would be as following:

 col_name       content:index+value
 ----------     ---------------------------
 fam:mytext     \x01This is the text

Incompatibilities

This are incompatibilities if a new schema with unions is used with old data:

  1. In Gora 0.2.1, 1-level data could be read easily by other ways just as serialized java objects. After the patch, optional fields would not keep this feature because of index byte.
  2. A schema with any optional field will be incompatible with legacy data.

Solutions

  1. Create a configuration opt-in option for a DataStore to write null-plus-onetype-unions without the index byte and delete the column when null. Process properly when reading. This allows other systems to read HBase data directly easily.
  2. Create a deprecated configuration option that will parse schemas ignoring unions and nulls, and will work like Gora 0.2.1.

Further works

Will be desirable to not write fields with the default value (and save one HBase column).