zebra是apache的一个开源项目，关于列存储，管理物理存储与元数据管理，有效的数据序列化。

Apache Zebra Wiki

Introduction

Zebra is a storage layer that provides a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. It provites

columnar storage format for fast data projection
schema language to manage physical storage metadata
CPU/space-efficient data serialization

In the future, it could also support predicate pushdown for further performance improvement. Initially, Zebra is released as a contrib project in Pig and can become a hadoop subproject later on.

Prerequisite

Zebra requires Hadoop 20 (as of July 24th, 2009 with Hadoop patch 6150) that supports TFile and works with Pig 0.3.0 with patch PIG-660. This patch makes PIG work with Hadoop 20. Zebra has been submitted as PIG-833.

Getting Zebra

Zebra has been committed as a Pig contrib project at:

Zebra source code

Compilation prerequisite:

JDK 1.6
Ant 1.7.1
Javacc 4.2

How to compile:

check out latest PIG trunk
apply the latest patch from PIG-660
copy hadoop20.jar attached to PIG-833 to Pig's top level ./lib
run 'ant jar' (generate Pig binary compatible with Hadoop 20)
run 'ant -Dtestcase=none test-core' (for zebra tests)
cd contrib/zebra
ant jar
ant test (for tests)

Zebra jar will be generated at build/contrib/zebra directory

Running Zebra

Sample Mapreduced code, Pig scripts attached to this wiki.

Java doc is available at Zebra JavaDoc

http://wiki.apache.org/pig/zebra