|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.pig.data.Datum
org.apache.pig.data.DataBag
public abstract class DataBag
A collection of Tuples. A DataBag may or may not fit into memory. DataBag extends spillable, which means that it registers with a memory manager. By default, it attempts to keep all of its contents in memory. If it is asked by the memory manager to spill to disk (by a call to spill()), it takes whatever it has in memory, opens a spill file, and writes the contents out. This may happen multiple times. The bag tracks all of the files it's spilled to. DataBag provides an Iterator interface, that allows callers to read through the contents. The iterators are aware of the data spilling. They have to be able to handle reading from files, as well as the fact that data they were reading from memory may have been spilled to disk underneath them. The DataBag interface assumes that all data is written before any is read. That is, a DataBag cannot be used as a queue. If data is written after data is read, the results are undefined. This condition is not checked on each add or read, for reasons of speed. Caveat emptor. Since spills are asynchronous (the memory manager requesting a spill runs in a separate thread), all operations dealing with the mContents Collection (which is the collection of tuples contained in the bag) have to be synchronized. This means that reading from a DataBag is currently serialized. This is ok for the moment because pig execution is currently single threaded. A ReadWriteLock was experimented with, but it was found to be about 10x slower than using the synchronize keyword. If pig changes its execution model to be multithreaded, we may need to return to this issue, as synchronizing reads will most likely defeat the purpose of multi-threading execution. DataBag come in several types, default, sorted, and distinct. The type must be chosen up front, there is no way to convert a bag on the fly.
Nested Class Summary | |
---|---|
static class |
DataBag.BagDelimiterTuple
|
static class |
DataBag.EndBag
|
static class |
DataBag.StartBag
|
Field Summary | |
---|---|
static Tuple |
endBag
|
protected static int |
MAX_SPILL_FILES
|
protected Collection<Tuple> |
mContents
|
protected long |
mMemSize
|
protected boolean |
mMemSizeChanged
|
protected long |
mSize
|
protected ArrayList<File> |
mSpillFiles
|
static Tuple |
startBag
|
Fields inherited from class org.apache.pig.data.Datum |
---|
ATOM, BAG, MAP, OBJECT_SIZE, RECORD_1, RECORD_2, RECORD_3, REF_SIZE, TUPLE |
Constructor Summary | |
---|---|
DataBag()
|
Method Summary | |
---|---|
void |
add(Tuple t)
Add a tuple to the bag. |
void |
addAll(DataBag b)
Add contents of a bag to the bag. |
int |
cardinality()
Deprecated. |
void |
clear()
Clear out the contents of the bag, both on disk and in memory. |
int |
compareTo(Object other)
This method is potentially very expensive since it may require a sort of the bag; don't call it unless you have to. |
Iterator<Tuple> |
content()
Deprecated. |
boolean |
equals(Object other)
|
protected void |
finalize()
Need to override finalize to clean out the mSpillFiles array. |
long |
getMemorySize()
Return the size of memory usage. |
protected DataOutputStream |
getSpillFile()
Get a file to spill contents to. |
int |
hashCode()
|
abstract boolean |
isDistinct()
Find out if the bag is distinct. |
abstract boolean |
isSorted()
Find out if the bag is sorted. |
abstract Iterator<Tuple> |
iterator()
Get an iterator to the bag. |
void |
markStale(boolean stale)
This is used by FuncEvalSpec.FakeDataBag. |
protected void |
reportProgress()
Report progress to HDFS. |
long |
size()
Get the number of elements in the bag, both in memory and on disk. |
String |
toString()
Write the bag into a string. |
void |
write(DataOutput out)
Write a bag's contents to disk. |
Methods inherited from class java.lang.Object |
---|
clone, getClass, notify, notifyAll, wait, wait, wait |
Methods inherited from interface org.apache.pig.impl.util.Spillable |
---|
spill |
Field Detail |
---|
protected Collection<Tuple> mContents
protected ArrayList<File> mSpillFiles
protected long mSize
protected boolean mMemSizeChanged
protected long mMemSize
public static final Tuple startBag
public static final Tuple endBag
protected static final int MAX_SPILL_FILES
Constructor Detail |
---|
public DataBag()
Method Detail |
---|
public long size()
public int cardinality()
public abstract boolean isSorted()
public abstract boolean isDistinct()
public abstract Iterator<Tuple> iterator()
iterator
in interface Iterable<Tuple>
@Deprecated public Iterator<Tuple> content()
public void add(Tuple t)
t
- tuple to add.public void addAll(DataBag b)
b
- bag to add contents of.public long getMemorySize()
getMemorySize
in interface Spillable
getMemorySize
in class Datum
public void clear()
public int compareTo(Object other)
compareTo
in interface Comparable
public boolean equals(Object other)
equals
in class Datum
public void write(DataOutput out) throws IOException
write
in class Datum
out
- DataOutput to write data to.
IOException
- (passes it on from underlying calls).public void markStale(boolean stale)
stale
- Set stale state.public String toString()
toString
in class Object
public int hashCode()
hashCode
in class Object
protected void finalize()
finalize
in class Object
protected DataOutputStream getSpillFile() throws IOException
IOException
protected void reportProgress()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |