Development notes (version 0.2.1)
Here will be described development details to consider in GORA-174.
Gora is an ORM framework with a huge valuable feature: it can use several different backends transparently. Gora has a common frontend implementing and object model wich allows standalone read/write/query, and that supports Hadoop. Each backend is independent from the others with the only requirement of implementing the frontend interface.
With Gora you don't have to worry about if your backend allows data nesting or not, if your backend is supported by Hadoop or not or if your application is designed to work with some NoSQL database or other.
In my oppinion, and only my oppinion, is that Gora should be low invasive: data schema is created and stored out of the backend so ideally you could access your data without Gora. We will see that this is hard to achieve at some extent (like in nested records or several types unions).
What is Avro and why is used here
Although I don't have much knowledge about Avro, just comment that Avro has two useful features used in Gora:
- You can define data structures. This is used in the frontend to define the data types.
- You can serialize data (based on the previous data structures). This is used optionally in some backends to serialize when needed. For example, in HBase is used when a column has a record; all the record gets serialized into the column.
What are UNIONs,and why they should be in Gora
Unions belong to Avro's schema specification. They represent a set of possible types for a field, for example:
['null','string']
. The leftmost type (in the example is 'null'
) is the default type.
In Gora 0.2.1 UNIONs are not implemented. Unions make possible to have:
- Optional fields
- Nested records (infinite levels)
- Variable types
Optional fields
Gora should not allow a null
value for a field not defined as ['null', ...]
. In fact, HbaseStore
fails when you try to write a null
in a field defined only as ['string']
. Other backends don't fail in this case, but this is not the desirable behavior.
When writing number based fields, for example with schema
['integer']
, the backend is writing the default java value0
but should fail.Nutch seems to be writing
null
values for ['string'] by (my guess) setting the field of the ORM instance as "not dirty" and this bypass the failing code. Maybe is other reason, but without taking a look at Nutch sourcecode this is the most reasonable.
Nested records
One of the goals of Gora is to provide a data model. This data model must be independent of the backend and be designed to be a trade-off between a global functionality and backends' functionality.
One concrete example are the nesting levels of each backend: HBase allows a 1-level nesting (family+column), MongoDB an infinite-level nesting (not sure), K-V stores a 0-level nesting. In this case, Gora should allow an infinite-level nesting. In 0.2.1 version recursive nesting is not allowed in schemas, but with unions this would be possible.
At this moment, Gora provides CRUD functionality for each backend. Furthemore, it provides MapReduce extensions for using those backends. Features like powerful column searches found in some backends are still not provided by Gora since that features are much different from one to other and is not a critical functionality. This is the trade-off: in order to be more general, some backend-specific features must be droped. Of course, any idea can be discussed.
In order to allow infinite-level nesting in 1-level-nesting backends, there are 2 aproaches:
Serialization based nesting
Complex data types (record, array, ...) gets serialized in order to fit in a single field. Serializing data with avro is handy since we have the schema inside ORM classes to deserialize again the data.
The drawback is that serialized data cannot be easily read by other applications not based on Gora. Actually is not that much bad: it is not a bad idea that if you plan to use Gora, you stick to it, but Gora should be prepared to be interconnected with those other applications not based on Gora. For this, the best solution maybe would be to allow to define "raw" fields in gora-xxx-mapping.xml
file.
Serialization is for complex structures, so you would not have them in a " not based on Gora application" , while some specific fields used in the " not based on Gora application" will be treat in a special way.
ColumnFamily based nesting
ColumnFamily based nesting avoiding serialization is possible. It allows (at first) access to data from " not based on Gora applications" , but in exchange for a much complex system. I am not much for this option because you will be using Gora with complex data structures to make those complex structures accesible from a " not based on Gora application" with a complex solution, and if you need those complex structures with more features than provided by Gora, maybe:
- makes no sense since you want complex structures in a " simple backend" but not stick to Gora. Accesing from outside to those complex structures will be a pain, and you don't get more than with serialization.
- you should change your backend or not use Gora because it does not fit your needs since you actually want to access the complex structure's data (remember, formerly not provided by your backend).
Let's face the implementation. This " ColumnFamily based nesting" is based on the idea of Column Families found in HBase and Cassandra. The idea is to create a new ColumnFamily for each nested field, and fields of that nested fields will be columns. For example, let's see this schema:
{ "name": "TestRow", "type": "record", "namespace": "com.test", "fields": [ {"name": "columnLong", "type": "long"}, {"name": "unionString", "type": ["null","string"]}, {"name": "columnRecord", "type":{ "name":"nestedRecord", "type":"record", "namespace":"com.test", "fields": [ {"name":"nestedColumnLong", "type":"long"} ] }}, {"name": "unionRecursive", "type": ["null","TestRow"]}, {"name": "familyMap", "type": {"type": "map", "values":"string"}} ] }
The example shows a record with 2 fields, a nested record, a recursive record and a map. Simulating how to save it in a K-F-C-V store (Key-Family-Column-Value), level-0 simple fields will be in one column family. The map will be in another column family. The nested record in the example will be in another column family (although a sophisticated system would treat its fields as level-0 fields).
But, what if nestedRecord
has again nested records?. Solution in this case will be " in another column family" . Fair.
And, what about the recursive record?. This time we have to take into account that the tree branches are an enumerable set, so if we have an structure like:
{ "name": "TestRow", "type": "record", "namespace": "com.test", "fields": [ {"name": "left", "type": ["null","TestRow"]}, {"name": "right", "type": ["null","TestRow"]} }
in order to allow and infinite-level nesting it will be mandatory to name Column Families something like:
- enumerable-based: ColumnName "TestRow-14" = the 14th node in a full tree being 0 the first left. So 14 is: left-left-left-left. Good at level iteration if you access with a " not based on Gora application" .
- path-based: ColumnName "TestRow-right-left" tells by itself the exact node. Good at depth iterating if you sorte ColumnFamily names when accessing with a " not based on Gora application" .
In both cases you gain some unexpected features, but accessing complex data from a " not based on Gora application" will be quite a pain (and forget to change avro schemas because will be a doom). You don't get much more that serializing.
In HBase, Column Families are static from the beginning and it is discouraged to create many. In Cassandra, due to memory consumption and issues regarding its management, maybe should not be created more than a few tens of Column Families.
Resuming, I vote for using serialization based nesting at first.
Gora (in the future) should be prepared to map some data to a column family when desired: for example if you want some fields subset to be in other column family because it is not related data to be fetched always.
Of course, I vote for other types of nested persistence depending on the backend. For example, native nesting for MongoDB.
The problem of data back compatibility
See details shown in Documentantion about implementation issues with HBase.
Implementation details in HBase
In Gora 0.2.1 HBase persists objects based on 2 levels of the schema. Given a schema like:
{ "name": "TestRow", "type": "record", "namespace": "com.test", "fields": [ {"name": "columnString", "type": "string"}, {"name": "columnRecord", "type":{ "name":"nestedRecord", "type":"record", "namespace":"com.test", "fields": [ {"name":"nestedColumnLong", "type":"long"} ] }}, {"name": "familyMap", "type": {"type": "map", "values":"string"}} ] }
level-0 fields (columnString
) are written as bytes based on Java. Maps and arrays are written in a separate ColumnFamily. Subrecords Level-1+ field's content: subrecords, maps and arrays, are serialized into one Column. Serialization is done with Avro(SpecificDatumWriter
) when called DataStore#put()
.
Maps and arrays only allow simple types.
Implementation details in Cassandra
CassandraStore uses Hector for accesing the server, but uses it basically for data definition and query: maganement of keyspaces, management of families + supercolumns + columns.
Notice that newer versions of Cassandra deprecates the use of Supercolumns, and use composite columns.
When calling DataStore#put()
, Cassandra makes a clone of the data. Wrongly creates a new Persistent
" by hand" instead using PersistentBase#clone()
.
Data gets serialized when calling DataStore#flush()
. What happens is that iterates over all clones and writes with serializer following this mapping:
- Simple types (level-0): add a column with value serialized as followling:
- Utf8 => Gora's Utf8Serializer that encapsulates Cassandra's StringSerializer
- Boolean,ByteBuffer,Double,Float,... => Cassandra's *Serializer (BooleanSerializer, ByteBufferSerializer,DoubleSerializer,...)
- Fixed => Gora's SpecificFixedSerializer.
- Array => Gora's GenericArraySerializer.
- Map => Gora's StatefulHashMapSerializer.
- Complex types at level-0:
- Records: Creates a Supercolumn (discouraged) and insert a column for each field. Arrays and Maps get serialized like a simple type above. Does not allow subrecords!
- Maps: Creates a Supercolumn (discouraged) and insert a column for each field. Allows any type as value except Records.
- Arrays: Creates a Supercolumn (discouraged) and insert a column for each field with incremental number as column name. Allows any type as value except Records.
Proposed implementations are two:
First option: Serialize Persistent
instances with avro and save them into Cassandra with Cassandra's ByteBufferSerializer.
Second option: Create a Gora's RecordSerializer
that will manage all serialization transparently -that will be usable for Hector's users too!- managing unions, etc... for level-1 nesting, AND will not force any change to anything outside gora-cassandra! < -- this, this!
Further work
- Analyze for each backend what mappings are allowed (and give much flexibility).
HBase
- Can't map Maps to Columns (only to Families)
- Can't have Maps of any type (only basic types)