org.apache.pig.data
Class DistinctDataBag

java.lang.Object
  extended by org.apache.pig.data.Datum
      extended by org.apache.pig.data.DataBag
          extended by org.apache.pig.data.DistinctDataBag
All Implemented Interfaces:
Comparable, Iterable<Tuple>, Spillable

public class DistinctDataBag
extends DataBag

An unordered collection of Tuples with no multiples. Data is stored without duplicates as it comes in. When it is time to spill, that data is sorted and written to disk. It must also be sorted upon the first read, otherwise if a spill happened after that the iterators would have no way to find their place in the new file. The data is stored in a HashSet. When it is time to sort it is placed in an ArrayList and then sorted. Dispite all these machinations, this was found to be faster than storing it in a TreeSet.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.pig.data.DataBag
DataBag.BagDelimiterTuple, DataBag.EndBag, DataBag.StartBag
 
Field Summary
 
Fields inherited from class org.apache.pig.data.DataBag
endBag, MAX_SPILL_FILES, mContents, mMemSize, mMemSizeChanged, mSize, mSpillFiles, startBag
 
Fields inherited from class org.apache.pig.data.Datum
ATOM, BAG, MAP, OBJECT_SIZE, RECORD_1, RECORD_2, RECORD_3, REF_SIZE, TUPLE
 
Constructor Summary
DistinctDataBag()
           
 
Method Summary
 void add(Tuple t)
          Add a tuple to the bag.
 void addAll(DataBag b)
          Add contents of a bag to the bag.
 boolean isDistinct()
          Find out if the bag is distinct.
 boolean isSorted()
          Find out if the bag is sorted.
 Iterator<Tuple> iterator()
          Get an iterator to the bag.
 long size()
          Get the number of elements in the bag, both in memory and on disk.
 long spill()
          Instructs an object to spill whatever it can to disk and release references to any data structures it spills.
 
Methods inherited from class org.apache.pig.data.DataBag
cardinality, clear, compareTo, content, equals, finalize, getMemorySize, getSpillFile, hashCode, markStale, reportProgress, toString, write
 
Methods inherited from class java.lang.Object
clone, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

DistinctDataBag

public DistinctDataBag()
Method Detail

isSorted

public boolean isSorted()
Description copied from class: DataBag
Find out if the bag is sorted.

Specified by:
isSorted in class DataBag

isDistinct

public boolean isDistinct()
Description copied from class: DataBag
Find out if the bag is distinct.

Specified by:
isDistinct in class DataBag

size

public long size()
Description copied from class: DataBag
Get the number of elements in the bag, both in memory and on disk.

Overrides:
size in class DataBag

iterator

public Iterator<Tuple> iterator()
Description copied from class: DataBag
Get an iterator to the bag. For default and distinct bags, no particular order is guaranteed. For sorted bags the order is guaranteed to be sorted according to the provided comparator.

Specified by:
iterator in interface Iterable<Tuple>
Specified by:
iterator in class DataBag

add

public void add(Tuple t)
Description copied from class: DataBag
Add a tuple to the bag.

Overrides:
add in class DataBag
Parameters:
t - tuple to add.

addAll

public void addAll(DataBag b)
Description copied from class: DataBag
Add contents of a bag to the bag.

Overrides:
addAll in class DataBag
Parameters:
b - bag to add contents of.

spill

public long spill()
Description copied from interface: Spillable
Instructs an object to spill whatever it can to disk and release references to any data structures it spills.



Copyright © ${year} The Apache Software Foundation