< back to index

480px | 640px | 800px | 1024px | full

GORA-109: Pig Adapter for Gora


Apache Gora is an Object Datastore Mapper which has its own data model. At the same time, Apache Pig has its own data model too. Because of this, it is needed an adaptation between both data models.

The objective of this document is to describe the approach taken to implement a Pig adapter for Gora.

Branch with the source code.

Data models

Gora's data entities are generated from Apache Avro schemas, and inherits the same datatypes defined in Avro.

Pig has its own data model.

The following table shows the different types and a possible conversions from types to types.

Primitive/Simple types null
int (32-bit)
long (64-bit)
float (32-bit)
double (64-bit)
bytes (8-bit)
string (unicode)
int (32-bit)
long (64-bit)
float (32-bit)
double (64-bit)
chararray (string UTF-8)
Complex types record
map<String, 'b>
map<chararray, 'b>
[the non-null type]

Since datetime, biginteger and bigdecimal aren't handled by Apache Gora, it isn't possible to persist those types.

For unions, only nullable fields (union:[null, type]) are handled. Fixed type is not handled.

Notice that Gora's records are converted into Pig's tuples, and arrays into bags (index matters). When persisting, those types are the expected when checking the schemas.

Reading from datastores

The storage GoraStorage is the responsible for loading and persisting entities. The simplest syntax to load data is the following:

register gora/*.jar;
webpage = LOAD '.' USING org.apache.gora.pig.GoraStorage('{
  "persistentClass": "admin.WebPage",
  "fields": "baseUrl,status,content"
}') ;

It loads the fields baseUrl, status and content (must not have spaces! -sorry-) for the entities WebPage.

The files gora.properties, gora-xxx-mapping.xml and support files are provided through the classpath to Pig client. They must be included inside one of the registered *.jar files.

The complete LOAD options allows to configure the options for each storage and avoid using the global configuration files when multiple different stores are used:

webpage = LOAD '.' USING org.apache.gora.pig.GoraStorage('{
  "persistentClass": "admin.WebPage",
  "keyClass": "java.lang.String",
  "fields": "*",
  "goraProperties": "",
  "mapping": "",
  "configuration": {}
}') ;

The configuration options are the following:

In JSON Strings, line feeds must be escaped as \\n.

An example of Gora properties value is:


An example of mapping is:

"<?xml version=\\"1.0\\" encoding=\\"UTF-8\\"?>\\n<gora-odm>\\n<table name=\\"webpage\\">\\n<family name=\\"f\\" maxVersions=\\"1\\"/>\\n</table>\\n<class table=\\"webpage\\" keyClass=\\"java.lang.String\\" name=\\"admin.WebPage\\">\\n<field name=\\"baseUrl\\" family=\\"f\\" qualifier=\\"bas\\"/>\\n<field name=\\"status\\" family=\\"f\\" qualifier=\\"st\\"/>\\n<field name=\\"content\\" family=\\"f\\" qualifier=\\"cnt\\"/>\\n</class>\\n</gora-odm>"

The configuration options is a JSON object with string key-values like this:

  "hbase.zookeeper.quorum": "hdp4,hdp1,hdp3",
  "zookeeper.znode.parent": "/hbase-unsecure"

Writing to datastores

To write a Pig relation to a datastore, the command is:

STORE webpages INTO '.' USING org.apache.gora.pig.GoraStorage('{
  "persistentClass": "",
  "fields": "",
  "goraProperties": "",
  "mapping": "",
  "configuration": {}
}') ;

All the fields listed in "fields" will be persisted. If a field listed is missing in the relation the process will fail with an exception. Only the fields listed will be updated if the element already exists.

Deleting elements

To delete elements of a collection is GoraDeleteStorage. Given a relation with schema (key:chararray) rows, the following will delete all rows with that keys:

STORE webpages INTO '.' USING org.apache.gora.pig.GoraDeleteStorage('{
  "persistentClass": "",
  "goraProperties": "",
  "mapping": "",
  "configuration": {}
}') ;

Implementation details

The storages are instantiated in the frontend (pig client) and in the backends (cluster nodes). On both sides a storage is instantiated several times.

On LOAD operation, the following methods of GoraStorage are called at the frontend:

Calls in orderInstanceDetails
setLocation()AMerges the "configuration" options into the job configuration and creates localJobConf.
Copies registered JARS to distributed cache
getInputFormat()BReturns a PigGoraInputFormat that overrides GoraInputFormat allowing to set the configuration without undesired effects (avoid overwrite of the datastore and query). The InputSplits will be created from this InputFormat, and those splits will be serialized and sent to the backend.

At the backend the methods called are:

Calls in orderInstanceDetails
setLocation()CMerges the "configuration" options into the job configuration and creates localJobConf.
prepareToRead()CGoraStorage gets a PigSplit wrapping one unserialized PigGoraInputFormat coming from the frontend.
getNext()CRepeats until all elements of the split are read.

DataStore instantiation

When the datastore is instantiated, it is being created with the call:

this.dataStore = DataStoreFactory.getDataStore(
  ) ;

The StorageConfiguration is the class that holds the configuration set at the Storage constructor in the Pig script:


It hasn't been possible to create tests with JUnit to tests GoraStorage in a pseudo distributed mode because HBaseTestingUtility seems to launch a Hadoop 1.x cluster. There is a commented-out tests class in tests folders. Help will be appreciated.