GORA-109: Pig Adapter for Gora
Apache Gora is an Object Datastore Mapper which has its own data model. At the same time, Apache Pig has its own data model too. Because of this, it is needed an adaptation between both data models.
The objective of this document is to describe the approach taken to implement a Pig adapter for Gora.
Data models
Gora's data entities are generated from Apache Avro schemas, and inherits the same datatypes defined in Avro.
Pig has its own data model.
The following table shows the different types and a possible conversions from types to types.
Gora | Pig | |
---|---|---|
Primitive/Simple types | null boolean int (32-bit) long (64-bit) float (32-bit) double (64-bit) bytes (8-bit) string (unicode) - - - |
null boolean int (32-bit) long (64-bit) float (32-bit) double (64-bit) bytearray chararray (string UTF-8) datetime biginteger bigdecimal |
Complex types | record enum array map<String, 'b> union fixed |
tuple int bag map<chararray, 'b> [the non-null type] - |
Since datetime
, biginteger
and bigdecimal
aren't handled by Apache Gora, it isn't possible to persist those types.
For unions, only nullable fields (union:[null, type]
) are handled. Fixed
type is not handled.
Notice that Gora's records are converted into Pig's tuples, and arrays into bags (index matters). When persisting, those types are the expected when checking the schemas.
Reading from datastores
The storage GoraStorage
is the responsible for loading and persisting entities. The simplest syntax to load data is the following:
register gora/*.jar; webpage = LOAD '.' USING org.apache.gora.pig.GoraStorage('{ "persistentClass": "admin.WebPage", "fields": "baseUrl,status,content" }') ;
It loads the fields baseUrl
, status
and content
(must not have spaces! -sorry-) for the entities WebPage
.
The files gora.properties
, gora-xxx-mapping.xml
and support files are provided through the classpath to Pig client. They must be included
inside one of the registered *.jar files.
The complete LOAD options allows to configure the options for each storage and avoid using the global configuration files when multiple different stores are used:
webpage = LOAD '.' USING org.apache.gora.pig.GoraStorage('{ "persistentClass": "admin.WebPage", "keyClass": "java.lang.String", "fields": "*", "goraProperties": "", "mapping": "", "configuration": {} }') ;
The configuration options are the following:
persistentClass
(mandatory): The full name of the persistent class including the namespace.keyClass
: The full name of the key class. By now onlyjava.lang.String
is supported.fields
(mandatory): Comma-separated list of field names (without spaces!) or '*
' to load all fields.goraProperties
: String withgora.properties
configuration. Each line must be separated by\\n
.mapping
: XML mapping for the entities loaded. Each line must be separated by\\n
and escaped quotes as\\"
configuration
: object with a map from keys to values that will be added to the configuration
In JSON Strings, line feeds must be escaped as \\n
.
An example of Gora properties value is:
"gora.datastore.default=org.apache.gora.hbase.store.HBaseStore\\ngora.datastore.autocreateschema=true\\ngora.hbasestore.scanner.caching=4"
An example of mapping is:
"<?xml version=\\"1.0\\" encoding=\\"UTF-8\\"?>\\n<gora-odm>\\n<table name=\\"webpage\\">\\n<family name=\\"f\\" maxVersions=\\"1\\"/>\\n</table>\\n<class table=\\"webpage\\" keyClass=\\"java.lang.String\\" name=\\"admin.WebPage\\">\\n<field name=\\"baseUrl\\" family=\\"f\\" qualifier=\\"bas\\"/>\\n<field name=\\"status\\" family=\\"f\\" qualifier=\\"st\\"/>\\n<field name=\\"content\\" family=\\"f\\" qualifier=\\"cnt\\"/>\\n</class>\\n</gora-odm>"
The configuration options is a JSON object with string key-values like this:
{ "hbase.zookeeper.quorum": "hdp4,hdp1,hdp3", "zookeeper.znode.parent": "/hbase-unsecure" }
Writing to datastores
To write a Pig relation to a datastore, the command is:
STORE webpages INTO '.' USING org.apache.gora.pig.GoraStorage('{ "persistentClass": "", "fields": "", "goraProperties": "", "mapping": "", "configuration": {} }') ;
All the fields listed in "fields" will be persisted. If a field listed is missing in the relation the process will fail with an exception. Only the fields listed will be updated if the element already exists.
Deleting elements
To delete elements of a collection is GoraDeleteStorage
. Given a relation with schema (key:chararray)
rows, the following will delete all rows with that keys:
STORE webpages INTO '.' USING org.apache.gora.pig.GoraDeleteStorage('{ "persistentClass": "", "goraProperties": "", "mapping": "", "configuration": {} }') ;
Implementation details
The storages are instantiated in the frontend (pig client) and in the backends (cluster nodes). On both sides a storage is instantiated several times.
On LOAD
operation, the following methods of GoraStorage
are called at the frontend:
Calls in order | Instance | Details | constructor() | A | setUDFContextSignature() | A | getSchema() | A | setLocation() | A | Merges the "configuration" options into the job configuration and creates localJobConf . |
Copies registered JARS to distributed cache | constructor() | B | setUDFContextSignature() | B | setLocation() | B | getInputFormat() | B | Returns a PigGoraInputFormat that overrides GoraInputFormat allowing to set the configuration without undesired effects (avoid overwrite of the datastore and query). The InputSplits will be created from this InputFormat , and those splits will be serialized and sent to the backend. |
---|
At the backend the methods called are:
Calls in order | Instance | Details | constructor() | C | setUDFContextSignature() | C | setLocation() | C | Merges the "configuration" options into the job configuration and creates localJobConf . |
getInputFormat() | C | setUDFContextSignature() | C | prepareToRead() | C | GoraStorage gets a PigSplit wrapping one unserialized PigGoraInputFormat coming from the frontend. |
---|---|---|
getNext() | C | Repeats until all elements of the split are read. |
DataStore instantiation
When the datastore is instantiated, it is being created with the call:
this.dataStore = DataStoreFactory.getDataStore( this.storageConfiguration.getKeyClass(), this.storageConfiguration.getPersistentClass(), this.storageConfiguration.getGoraPropertiesAsProperties(), this.localJobConf ) ;
The StorageConfiguration
is the class that holds the configuration set at the Storage constructor in the Pig script:
- The key class is from the constructor parameters (only
java.lang.String
). - The pesistent class is from the constructor parameters
- The Gora properties is from the constructor, but if it is empty it will be taken from classpath at
DataStoreFactory
. - The mapping defined is added to the gora properties with the key "gora.mapping". The especific store must take this key into consideration when loading the mapping.
- The configuration in the constructor has been merged into
localJobConf
at#setLocation()
.
Testing
It hasn't been possible to create tests with JUnit to tests GoraStorage
in a pseudo distributed mode because HBaseTestingUtility
seems to launch a Hadoop 1.x cluster. There is a commented-out tests class in tests folders. Help will be appreciated.