org.apache.pig.data
Class DataBag

java.lang.Object
  extended by org.apache.pig.data.Datum
      extended by org.apache.pig.data.DataBag
All Implemented Interfaces:
Comparable, Iterable<Tuple>, Spillable
Direct Known Subclasses:
DefaultDataBag, DistinctDataBag, SortedDataBag

public abstract class DataBag
extends Datum
implements Spillable, Iterable<Tuple>

A collection of Tuples. A DataBag may or may not fit into memory. DataBag extends spillable, which means that it registers with a memory manager. By default, it attempts to keep all of its contents in memory. If it is asked by the memory manager to spill to disk (by a call to spill()), it takes whatever it has in memory, opens a spill file, and writes the contents out. This may happen multiple times. The bag tracks all of the files it's spilled to. DataBag provides an Iterator interface, that allows callers to read through the contents. The iterators are aware of the data spilling. They have to be able to handle reading from files, as well as the fact that data they were reading from memory may have been spilled to disk underneath them. The DataBag interface assumes that all data is written before any is read. That is, a DataBag cannot be used as a queue. If data is written after data is read, the results are undefined. This condition is not checked on each add or read, for reasons of speed. Caveat emptor. Since spills are asynchronous (the memory manager requesting a spill runs in a separate thread), all operations dealing with the mContents Collection (which is the collection of tuples contained in the bag) have to be synchronized. This means that reading from a DataBag is currently serialized. This is ok for the moment because pig execution is currently single threaded. A ReadWriteLock was experimented with, but it was found to be about 10x slower than using the synchronize keyword. If pig changes its execution model to be multithreaded, we may need to return to this issue, as synchronizing reads will most likely defeat the purpose of multi-threading execution. DataBag come in several types, default, sorted, and distinct. The type must be chosen up front, there is no way to convert a bag on the fly.


Nested Class Summary
static class DataBag.BagDelimiterTuple
           
static class DataBag.EndBag
           
static class DataBag.StartBag
           
 
Field Summary
static Tuple endBag
           
protected static int MAX_SPILL_FILES
           
protected  Collection<Tuple> mContents
           
protected  long mMemSize
           
protected  boolean mMemSizeChanged
           
protected  long mSize
           
protected  ArrayList<File> mSpillFiles
           
static Tuple startBag
           
 
Fields inherited from class org.apache.pig.data.Datum
ATOM, BAG, MAP, OBJECT_SIZE, RECORD_1, RECORD_2, RECORD_3, REF_SIZE, TUPLE
 
Constructor Summary
DataBag()
           
 
Method Summary
 void add(Tuple t)
          Add a tuple to the bag.
 void addAll(DataBag b)
          Add contents of a bag to the bag.
 int cardinality()
          Deprecated.
 void clear()
          Clear out the contents of the bag, both on disk and in memory.
 int compareTo(Object other)
          This method is potentially very expensive since it may require a sort of the bag; don't call it unless you have to.
 Iterator<Tuple> content()
          Deprecated. 
 boolean equals(Object other)
           
protected  void finalize()
          Need to override finalize to clean out the mSpillFiles array.
 long getMemorySize()
          Return the size of memory usage.
protected  DataOutputStream getSpillFile()
          Get a file to spill contents to.
 int hashCode()
           
abstract  boolean isDistinct()
          Find out if the bag is distinct.
abstract  boolean isSorted()
          Find out if the bag is sorted.
abstract  Iterator<Tuple> iterator()
          Get an iterator to the bag.
 void markStale(boolean stale)
          This is used by FuncEvalSpec.FakeDataBag.
protected  void reportProgress()
          Report progress to HDFS.
 long size()
          Get the number of elements in the bag, both in memory and on disk.
 String toString()
          Write the bag into a string.
 void write(DataOutput out)
          Write a bag's contents to disk.
 
Methods inherited from class java.lang.Object
clone, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.apache.pig.impl.util.Spillable
spill
 

Field Detail

mContents

protected Collection<Tuple> mContents

mSpillFiles

protected ArrayList<File> mSpillFiles

mSize

protected long mSize

mMemSizeChanged

protected boolean mMemSizeChanged

mMemSize

protected long mMemSize

startBag

public static final Tuple startBag

endBag

public static final Tuple endBag

MAX_SPILL_FILES

protected static final int MAX_SPILL_FILES
See Also:
Constant Field Values
Constructor Detail

DataBag

public DataBag()
Method Detail

size

public long size()
Get the number of elements in the bag, both in memory and on disk.


cardinality

public int cardinality()
Deprecated. Use size() instead.


isSorted

public abstract boolean isSorted()
Find out if the bag is sorted.


isDistinct

public abstract boolean isDistinct()
Find out if the bag is distinct.


iterator

public abstract Iterator<Tuple> iterator()
Get an iterator to the bag. For default and distinct bags, no particular order is guaranteed. For sorted bags the order is guaranteed to be sorted according to the provided comparator.

Specified by:
iterator in interface Iterable<Tuple>

content

@Deprecated
public Iterator<Tuple> content()
Deprecated. 

Deprected. Use iterator() instead.


add

public void add(Tuple t)
Add a tuple to the bag.

Parameters:
t - tuple to add.

addAll

public void addAll(DataBag b)
Add contents of a bag to the bag.

Parameters:
b - bag to add contents of.

getMemorySize

public long getMemorySize()
Return the size of memory usage.

Specified by:
getMemorySize in interface Spillable
Specified by:
getMemorySize in class Datum

clear

public void clear()
Clear out the contents of the bag, both on disk and in memory. Any attempts to read after this is called will produce undefined results.


compareTo

public int compareTo(Object other)
This method is potentially very expensive since it may require a sort of the bag; don't call it unless you have to.

Specified by:
compareTo in interface Comparable

equals

public boolean equals(Object other)
Specified by:
equals in class Datum

write

public void write(DataOutput out)
           throws IOException
Write a bag's contents to disk.

Specified by:
write in class Datum
Parameters:
out - DataOutput to write data to.
Throws:
IOException - (passes it on from underlying calls).

markStale

public void markStale(boolean stale)
This is used by FuncEvalSpec.FakeDataBag.

Parameters:
stale - Set stale state.

toString

public String toString()
Write the bag into a string.

Overrides:
toString in class Object

hashCode

public int hashCode()
Overrides:
hashCode in class Object

finalize

protected void finalize()
Need to override finalize to clean out the mSpillFiles array.

Overrides:
finalize in class Object

getSpillFile

protected DataOutputStream getSpillFile()
                                 throws IOException
Get a file to spill contents to. The file will be registered in the mSpillFiles array.

Returns:
stream to write tuples to.
Throws:
IOException

reportProgress

protected void reportProgress()
Report progress to HDFS.



Copyright © ${year} The Apache Software Foundation