Programming Assignment 8: Bucket Set

Initial Due Date: 2025-11-14 8:00AM
Final Due Date: 2025-12-04 4:15PM

Goals

Practice some more with defining a class, writing methods and using instance variables.
Run an experiment to assess the runtime performance of a method.

Background

In class, we saw that the in operator for lists has a linear complexity \(O(n)\). We had also previously made the comparison between list and set for the in operator (see lists_vs_sets.py). The runtime complexity of in for set is actually constant \(O(1)\).

In this assignment, you’ll implement your own version of a set called BucketSet and then run some experiments to evaluate the performance and compare it with the built-in set and list data structures. Download the starter file pa8_bucketset.py and save it to your cs146 folder.

Main idea: A BucketSet will maintain a list of “buckets” to save items, where each bucket is a list. Whenever an item is added to a BucketSet, we need to figure out which bucket to add the item to and then append it to this bucket. Similarly, figuring out if an item is in a BucketSet consists of determining the bucket index and then checking if the item is in that bucket. If we limit the number of items that can be added to any individual bucket, then in on the bucket will be pretty fast. But it means that we need to make more buckets whenever we add an item to a BucketSet that breaches this maximum size for a single bucket.

We will start off with 8 buckets, and double the number of buckets whenever any individual bucket has more than (\(>\)) 10 items. This means the number of buckets will always be a multiple of 8.

Part 1

Your main task in this part is to implement (1) the add method (Part 1A) and (2) the __contains__ method which is what Python looks for when you use in (Part 1B).

First, note the two instance variables for a BucketSet: buckets and num_entries. The buckets are initially a list-of-lists of size 8 (and each bucket is initially empty). The num_entries variable keeps track of how many items are in the set, which is returned when we call len on an instance of a BucketSet (notice how __len__ is defined).

Part 1A: The add method should work like Python’s built-in set add method. For a BucketSet, you first need to figure out which bucket to add the item to, which can be done by calling the provided which_bucket(buckets, item) function. This function returns the index of the bucket item should be added to (in buckets). A function is also provided to allocate the buckets. Here is an example:

>>> buckets = allocate_buckets(8) # this allocates a list-of-lists, here a list of 8 empty lists
>>> which_bucket(buckets, 13854)
6
>>> which_bucket(buckets, 'z')
3

Notice that the which_bucket function uses the modulus operator % so that the index of the bucket is always in the appropriate bounds. The hash function returns a unique integer of an immutable object so that we can then take the modulus with the number of buckets. This means your BucketSet will work with several types (not just integers).

In the example above, 13854 would be added to the seventh bucket (index 6) and z would be added to the fourth bucket (index 3).

Complete the add method (remove the pass statement) to add an item to the BucketSet. Here are some other things your add method should consider:

Remember that a set has unique values. If the item is already in the bucket, don’t add it again.
If the bucket you add the item to has more than 10 items (after the addition), then we are going to need to redistribute ALL the items in the BucketSet by (1) doubling the number of buckets and then (2) re-adding all the current items into the new buckets.

Here is some pseudocode to help with the add method:

Determine which index an item should be added to. Call this bucket \(i\).
If item is already in bucket \(i\), don’t add it. Otherwise, add item to bucket \(i\) and increment self.num_entries.
Check if bucket \(i\) now has more than 10 items (after the addition in Step 2). If so, use allocate_buckets to create a new list of buckets (new_buckets) with double the number of buckets.
Insert all the items currently stored in self.buckets into new_buckets. The bucket indices for the items might change because you should now be passing new_buckets to which_bucket.
Assign self.buckets = new_buckets after adding all the items.

For example, imagine you add the items 0, 8, 16, 24, 32, 40, 48, 56, 64, 72 to a BucketSet with 8 buckets (initially). These would all be added to the same (first) bucket since they are all multiples of 8. All the other buckets are empty. Here are what the buckets would look like:

[[0, 8, 16, 24, 32, 40, 48, 56, 64, 72], [], [], [], [], [], [], []]

However, say we now add 80 to the BucketSet. This would also add to the first bucket, but because it now has 11 items, we need to double the number of buckets to 16, so the buckets would now look like:

[[0, 16, 32, 48, 64, 80], [], [], [], [], [], [], [], [8, 24, 40, 56, 72], [], [], [], [], [], [], []]

Note that some items get redistributed to another bucket since the number of buckets changed (and hence the result of % used in the which_bucket function changed).

Part 1B: Complete the __contains__ method which determines if an item is in a BucketSet. Again, you’ll need to determine which bucket that item should be in using the which_bucket function, and then use in on the particular list which represents the bucket. Try it out:

>>> s = BucketSet()
>>> s.add(1)
>>> s.add(2)
>>> s.add(3)
>>> 1 in s
True
>>> 4 in s
False

Part 2

Expand the Experimentation Code section below and copy it to your pa8_bucketset.py file. Then call experiment(1000, 10000, 100000, 5000) which will compare the performance of querying your BucketSet (via in) with Python’s built-in set and list. Copy the output printed in the shell into the docstring at the top of your submission and add a sentence describing what you notice. Specifically, how does your BucketSet compare with Python’s list? How does it compare with Python’s set? What does the experiment suggest the big-O complexity is for your BucketSet __contains__ (in) operator?

The Gradescope tests will specifically look for the header of the printed table, so be sure to include it.

Note that the experiment function uses something called an f-string to format the timing results. We haven’t talked about this yet, but it is a much more convenient way to format strings. For example, if you have variables x, y and z you want to print, you can prefix the string with f and then enclose the variables you want to print in curly-braces (within the string):

>>> print(f"{x}, {y}, {z}")

Experimentation Code

Note that the time and random modules are already imported, which are needed for the code below.

def query_data(data, num_queries, largest):
    """
    Compute time to execute random membership queries.

    Args:
        data: List or Set
        num_queries: Number of membership queries to perform
        largest: Upper bound for random integers to query

    Returns:
        Elapsed time for all queries
    """
    count = 0 # we won't do anything with the count
    begin = time.time()
    for _ in range(num_queries):
        random_num = random.randint(0, largest)
        if random_num in data:
            count += 1
    return time.time() - begin

def experiment(num_queries, size_min, size_max, step_size):
    """
    Print a table of runtimes for querying lists and sets of different sizes.

    Args:
        num_queries: Number of membership queries to perform
        size_min: Minimum container size
        size_max: Maximum container size
        step_size: Container step size
    """
    print(f"| {"size":6} | {"bucketset":12} | {"set":12} | {"list":12} |")
    print("-" * 55)
    for size in range(size_min, size_max + 1, step_size):   

        max_value = 10 * size

        # Generate a set with random data, and create a bucket set
        # and list with the same data
        set_data = set()
        while len(set_data) < size:
           set_data.add(random.randint(0, max_value))
        bucketset_data = BucketSet()
        for item in set_data:
            bucketset_data.add(item)
        list_data = list(set_data)
    
        # Time the set and list querying
        bucketset_elapsed = query_data(bucketset_data, num_queries, max_value)
        set_elapsed = query_data(set_data, num_queries, max_value)
        list_elapsed = query_data(list_data, num_queries, max_value)
        
        # Report the results for this size
        # Note this uses something called an f-string to format the timing results
        print(f"| {size:6} | {bucketset_elapsed:1.10f} | {set_elapsed:1.10f} | {list_elapsed:1.10f} |")

Creativity Points

Here are some possible creativity additions, although you are encouraged to include your own ideas. Make sure to document your additions in the docstring comment at the top of the file.

[1 point] Extend the __init__ method of BucketSet to accept an optional sequence parameter to initialize the contents of the bucket set. For example,

>>> s = BucketSet([1, 2, 3])

will create a BucketSet containing the items 1, 2 and 3.

[1 point for each overloaded operator] Add __or__ to perform union (via |), __sub__ to perform the difference (via -), __and__ to perform the intersection (via &), __xor__ to perform the symmetric difference between two BucketSets sets.
[1 point] Add a __str__ to print out a BucketSet. The entries should appear in the same way that the entries in a set appear. For example:

>>> s = BucketSet()
>>> s.add(1)
>>> s.add(2)
>>> s.add(3)
>>> print(s)
{1, 2, 3}

When you’re done

Make sure that your program is properly documented:

You should have a docstring at the very beginning of the file briefly describing your program and stating your name, section and creativity additions.
Each function should have an appropriate docstring (including arguments and return value if applicable).
Other miscellaneous inline/block comments if the code might otherwise be unclear.

In addition, make sure that you’ve used good code design and style (including helper functions where useful, meaningful variable names, constants where relevant, vertical white space, removing “dead code” that doesn’t do anything, removing testing code, etc.).

Submit your programs via Gradescope. Your file must be named pa8_bucketset.py. You can submit multiple times, with only the most recent submission (before the due date) graded.

Grading

Assessment	Requirements
Revision needed	Some but not all tests are passing.
Meets Expectations	All tests pass, the required functions are implemented correctly and your implementation uses satisfactory style.
Exemplary	All requirements for Meets Expectations, 2 creativity points, and your implementation is clear, concise, readily understood, and maintainable.