< back to index

480px | 640px | 800px | 1024px | full

Development notes (version 0.2.1)

mar-2013

Here will be described development details to consider in GORA-174.

Gora is an ORM framework with a huge valuable feature: it can use several different backends transparently. Gora has a common frontend implementing and object model wich allows standalone read/write/query, and that supports Hadoop. Each backend is independent from the others with the only requirement of implementing the frontend interface.

With Gora you don't have to worry about if your backend allows data nesting or not, if your backend is supported by Hadoop or not or if your application is designed to work with some NoSQL database or other.

In my oppinion, and only my oppinion, is that Gora should be low invasive: data schema is created and stored out of the backend so ideally you could access your data without Gora. We will see that this is hard to achieve at some extent (like in nested records or several types unions).

What is Avro and why is used here

Although I don't have much knowledge about Avro, just comment that Avro has two useful features used in Gora:

What are UNIONs,and why they should be in Gora

Unions belong to Avro's schema specification. They represent a set of possible types for a field, for example: ['null','string']. The leftmost type (in the example is 'null') is the default type.

In Gora 0.2.1 UNIONs are not implemented. Unions make possible to have:

Optional fields

Gora should not allow a null value for a field not defined as ['null', ...]. In fact, HbaseStore fails when you try to write a null in a field defined only as ['string']. Other backends don't fail in this case, but this is not the desirable behavior.

When writing number based fields, for example with schema ['integer'], the backend is writing the default java value 0 but should fail.

Nutch seems to be writing null values for ['string'] by (my guess) setting the field of the ORM instance as "not dirty" and this bypass the failing code. Maybe is other reason, but without taking a look at Nutch sourcecode this is the most reasonable.

Nested records

One of the goals of Gora is to provide a data model. This data model must be independent of the backend and be designed to be a trade-off between a global functionality and backends' functionality.

One concrete example are the nesting levels of each backend: HBase allows a 1-level nesting (family+column), MongoDB an infinite-level nesting (not sure), K-V stores a 0-level nesting. In this case, Gora should allow an infinite-level nesting. In 0.2.1 version recursive nesting is not allowed in schemas, but with unions this would be possible.

At this moment, Gora provides CRUD functionality for each backend. Furthemore, it provides MapReduce extensions for using those backends. Features like powerful column searches found in some backends are still not provided by Gora since that features are much different from one to other and is not a critical functionality. This is the trade-off: in order to be more general, some backend-specific features must be droped. Of course, any idea can be discussed.

In order to allow infinite-level nesting in 1-level-nesting backends, there are 2 aproaches:

Serialization based nesting

Complex data types (record, array, ...) gets serialized in order to fit in a single field. Serializing data with avro is handy since we have the schema inside ORM classes to deserialize again the data.

The drawback is that serialized data cannot be easily read by other applications not based on Gora. Actually is not that much bad: it is not a bad idea that if you plan to use Gora, you stick to it, but Gora should be prepared to be interconnected with those other applications not based on Gora. For this, the best solution maybe would be to allow to define "raw" fields in gora-xxx-mapping.xml file.

Serialization is for complex structures, so you would not have them in a " not based on Gora application" , while some specific fields used in the " not based on Gora application" will be treat in a special way.

ColumnFamily based nesting

ColumnFamily based nesting avoiding serialization is possible. It allows (at first) access to data from " not based on Gora applications" , but in exchange for a much complex system. I am not much for this option because you will be using Gora with complex data structures to make those complex structures accesible from a " not based on Gora application" with a complex solution, and if you need those complex structures with more features than provided by Gora, maybe:

Let's face the implementation. This " ColumnFamily based nesting" is based on the idea of Column Families found in HBase and Cassandra. The idea is to create a new ColumnFamily for each nested field, and fields of that nested fields will be columns. For example, let's see this schema:

{
  "name": "TestRow", "type": "record", "namespace": "com.test",
  "fields": [
    {"name": "columnLong",     "type": "long"},
    {"name": "unionString",    "type": ["null","string"]},
    {"name": "columnRecord",
       "type":{
         "name":"nestedRecord", "type":"record", "namespace":"com.test",
         "fields": [
           {"name":"nestedColumnLong", "type":"long"}
         ]
       }},
    {"name": "unionRecursive", "type": ["null","TestRow"]},
    {"name": "familyMap",        "type": {"type": "map", "values":"string"}}
  ]
}

The example shows a record with 2 fields, a nested record, a recursive record and a map. Simulating how to save it in a K-F-C-V store (Key-Family-Column-Value), level-0 simple fields will be in one column family. The map will be in another column family. The nested record in the example will be in another column family (although a sophisticated system would treat its fields as level-0 fields).

But, what if nestedRecord has again nested records?. Solution in this case will be " in another column family" . Fair.

And, what about the recursive record?. This time we have to take into account that the tree branches are an enumerable set, so if we have an structure like:

{
  "name": "TestRow", "type": "record", "namespace": "com.test",
  "fields": [
    {"name": "left", "type": ["null","TestRow"]},
    {"name": "right", "type": ["null","TestRow"]}
}

in order to allow and infinite-level nesting it will be mandatory to name Column Families something like:

In both cases you gain some unexpected features, but accessing complex data from a " not based on Gora application" will be quite a pain (and forget to change avro schemas because will be a doom). You don't get much more that serializing.

In HBase, Column Families are static from the beginning and it is discouraged to create many. In Cassandra, due to memory consumption and issues regarding its management, maybe should not be created more than a few tens of Column Families.

Resuming, I vote for using serialization based nesting at first.

Gora (in the future) should be prepared to map some data to a column family when desired: for example if you want some fields subset to be in other column family because it is not related data to be fetched always.

Of course, I vote for other types of nested persistence depending on the backend. For example, native nesting for MongoDB.

The problem of data back compatibility

See details shown in Documentantion about implementation issues with HBase.

Implementation details in HBase

In Gora 0.2.1 HBase persists objects based on 2 levels of the schema. Given a schema like:

{
  "name": "TestRow", "type": "record", "namespace": "com.test",
  "fields": [
    {"name": "columnString", "type": "string"},
    {"name": "columnRecord",
       "type":{
         "name":"nestedRecord", "type":"record", "namespace":"com.test",
         "fields": [
           {"name":"nestedColumnLong", "type":"long"}
         ]
       }},
    {"name": "familyMap", "type": {"type": "map", "values":"string"}}
  ]
}

level-0 fields (columnString) are written as bytes based on Java. Maps and arrays are written in a separate ColumnFamily. Subrecords Level-1+ field's content: subrecords, maps and arrays, are serialized into one Column. Serialization is done with Avro(SpecificDatumWriter) when called DataStore#put().

Maps and arrays only allow simple types.

Implementation details in Cassandra

CassandraStore uses Hector for accesing the server, but uses it basically for data definition and query: maganement of keyspaces, management of families + supercolumns + columns.

Notice that newer versions of Cassandra deprecates the use of Supercolumns, and use composite columns.

When calling DataStore#put(), Cassandra makes a clone of the data. Wrongly creates a new Persistent " by hand" instead using PersistentBase#clone().

Data gets serialized when calling DataStore#flush(). What happens is that iterates over all clones and writes with serializer following this mapping:

Proposed implementations are two:

First option: Serialize Persistent instances with avro and save them into Cassandra with Cassandra's ByteBufferSerializer.

Second option: Create a Gora's RecordSerializer that will manage all serialization transparently -that will be usable for Hector's users too!- managing unions, etc... for level-1 nesting, AND will not force any change to anything outside gora-cassandra! < -- this, this!

Further work

HBase