Class 9

Lists, loops, files, modules, the command line

Objectives for today

  • Use loops to compute statistics for different inputs
  • Describe the motivations for using data structures as the inputs to our functions
  • Open a file for reading and read its contents or iterate through its contents line-by-line
  • Explain the purpose of and use a with block to encapsulate file operations
  • Use optional and keyword arguments
  • Create a correctly formatted and documented module
  • Utilize command line arguments to control program execution
  • Launch Python programs from the command line

To follow along with the lecture, please download all of these files and save them in your cs146 folder:

Writing some functions with lists

Let’s develop some functions to compute statistics about the words in a sentence. To do so we are first going to split the sentence into a list of words. Strings have a method to do exactly that, split.

help(str.split)
Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1) unbound builtins.str method
    Return a list of the substrings in the string, using sep as the separator string.

      sep
        The separator used to split the string.

        When set to None (the default value), will split on any whitespace
        character (including \n \r \t \f and spaces) and will discard
        empty strings from the result.
      maxsplit
        Maximum number of splits.
        -1 (the default value) means no limit.

    Splitting starts at the front of the string and works to the end.

    Note, str.split() is mainly useful for data that has been intentionally
    delimited.  With natural text that includes punctuation, consider using
    the regular expression module.
"this is a sentence".split()
['this', 'is', 'a', 'sentence']

Let’s develop some functions for computing statistics about lists of words, specifically

  • average_word_length(words)
  • longest_word(words)
  • shortest_word(words)

Let’s start with average_word_length. What components will we need in that function? What about in longest_word or shortest_word? For the latter, can we use the built-in min and max function we saw previously?

We will need a loop to iterate through the list of words. Specifically, we can use a for loop since we know the number of iterations (the length of the list) ahead of time. One of the key challenges here is that we are not comparing the strings directly, e.g., “the” vs. “antidisestablishmentarianism”, but the lengths of those strings. In other words, we will first need to determine the lengths of each string, then compute the average, etc. If we just use max(words), we will get the lexicographically largest word not the longest word. As an example, consider:

words = "this is a sentence".split()
max(words)
'this'

Let’s start with this template and assume the list we are processing is non-empty.

def shortest_word(words):
    """
    Find shortest string in collection of strings
    
    Args:
        words: Non-empty collection of strings
    
    Returns:
        shortest string
    """
    
    shortest = words[0]
    for word in words:
        if len(word) < len(shortest):
            shortest = word
      
    return shortest

def longest_word(words):
    """
    Find longest string in collection of strings

    Args:
        words: Non-empty collection of strings
        
    Returns:
        longest string
    """
    longest = words[0]
    
    # To avoid comparing words[0] twice, you could use words[1:]
    for word in words: 
        if len(word) > len(longest):
            longest = word
    
    return longest

There is another way we can write this using Python features we haven’t discussed yet. This involves passing a function as an argument to the max function (yes, we can pass a function itself as an argument to another function!).

longest = max(words, key=len)

Here key is an optional keyword parameter that expects a function as its argument. That function is applied to each element of the list before performing the comparison, i.e., comparing key(words[0]) to key(words[1]). Effectively it is performing a similar computation to our implementation above. Although we haven’t discussed it yet in class, we can assign functions to variables, pass them as arguments, etc.

Why did we use lists as the input?

Why did we use lists as the input to our functions? Could we not just have processed the input string one word at a time and not even created the list in the first place, e.g., something like the following:

for word in sentence.split():
    # Update average length numerator/denominator, etc.

Here we are analyzing a single string, but we can imagine other sources of words we might want to analyze, like a file of words. We can use data structures, like a list, to decouple components within program. In this case, we separate the source of the input from the calculations. We can use any input, strings, files, etc, that can be read into a list without changing our analysis functions (the average length, etc.)

Files

To date, all of our data has been ephemeral, like the sentence examples we just tested. But most data analyses of any scale will start and or end with data stored in a file. How do we read and write to files?

open("filename", "r")

The first argument is the path to the file. Note Python starts looking in the current working directory (typically the same directory as your script). If your file is elsewhere you will need to supply the necessary path. The second argument is the mode, e.g. 'r' for reading, 'w' for writing, etc.

Let’s open a file of English words for reading (small-file.txt):

file = open("small-file.txt", "r")
file
<_io.TextIOWrapper name='small-file.txt' mode='r' encoding='UTF-8'>

We don’t need to know what a TextIOWrapper is. That is part of Python’s internal implementation. Instead, it’s important for us to know that the file object created by open is iterable, i.e., can read through all the lines in the file with a for loop (see below) and it has methods like read, for reading the entire contents of the file.

Once we have opened the file we can read all of the lines with a for loop (note there are other ways to read a file, but we will use for loops most frequently). That is we use the file as the loop sequence.

for <loop variable> in <file variable>:
    <loop body>

The loop body will get executed for each line of the file with the loop variable assigned the line (as a string) including any newline (i.e. return or '\n' characters). For example:

for line in file:
    print(line)
hello world

how are you

cat

dog

Notice the empty lines, these result from the newlines in the file itself and the newline added by print. That is, the contents of the file are really equivalent to:

hello world\nhow are you\ncat\ndog\n"

When Python reads each line of file, it includes the newline, i.e., the first value for line is "hello world\n". By default print adds its own newline, so the result is to print hello world\n\n (note the two newline characters). We typically don’t want the newline from the file, so we often use string’s strip method to remove it from each line after reading it from the file.

help(str.strip)
Help on method_descriptor:

strip(self, chars=None, /) unbound builtins.str method
    Return a copy of the string with leading and trailing whitespace removed.

    If chars is given and not None, remove characters in chars instead.
a = "string with newline\n"
a.strip()
'string with newline'

If we try to run our loop again (with strip added), e.g.

for line in file:
    print(line.strip())

nothing will be printed. This is not unexpected. The file object maintains state, specifically a pointer to how much of the file has been read. When we first open the file, the pointer “points to” the beginning of the file. Once we have read the file it points to the end. Thus, there is nothing more to read. There are methods that we can use to reset the pointer, or we can close and then reopen the file.

All open files need to be closed (with the close method, e.g. file.close()). This is especially important for writing to files as it will force the data to actually be written to the disk/file system. You can do so manually, but it is easy to forget, and there are error situations where you may not be able to explicitly call close. Best practices are to use with blocks, which ensure that the file always (automatically) closes for you. For example:

with open("filename", "r") as file:
    # Work with file object
    # File is automatically closed when you exit the with block

In our class, the expectation is that you will always use with blocks when reading files.

Let’s put this together with the functions that we wrote earlier to generate basic statistics about pokemon.txt, which was generated from here (Pokémon with spaces in their names were removed).

>>> file_stats("pokemon.txt")
Number of words: 997
Longest word: Crabominable
Shortest word:  Muk
Avg. word length:  7.592778335005015

Notice that almost all the code is shared between sentence_stats and file_stats. This is nicely DRY! If you ever find yourself copying and pasting code, make a function instead.

Note that these examples assume that the file you are reading and the program are in the same directory.

Optional Parameters

We have used range extensively when writing for-loops. Sometimes we include a start, and sometimes we include a step:

>>> help(range)
Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object

This works because Python supports optional arguments, e.g. the optional “step”. How would we implement our own version of range? Let’s fill in (optional_parameters.py) with our own implementation:

def my_range(start, stop, step):
    """
    Return a range
    
    Args:
        start: inclusive start index
        stop: exclusive stop index
        step: range increment

    Returns: A list of integers
    """
    i = start
    r = []
    
    while i < stop:
        r.append(i)
        i += step
    
    return r

What if we want a default step of 1 (similar to the built-in range)? Maybe we can write a new function?

def my_range_with_unitstep(start, stop):
    return my_range(start, stop, 1)

This is DRY-ish, but we can condense these two functions into one by setting a default value for step. We can do so by providing a value in the function header as shown below. We describe those parameters with default values as “optional”, i.e., we no longer have to provide a value for that parameter when we call the function. If we don’t provide a value, Python uses the default value specified in the header.

def my_range(start, stop, step=1):

Now we can use the same function for the two different use cases. More generally, optional parameters are useful when there is a sensible default value (i.e. stepping by one) but the caller might want/need to change that value sometimes.

Note that you can also specify parameters by name, which is helpful if there are many optional parameters and you only want to change one or two.

>>> my_range(0, 5, step=2)
[0, 2, 4]
>>> my_range(start=1, stop=5)
[1, 2, 3, 4]
>>> my_range(5, start=0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: my_range() got multiple values for argument 'start'
>>> my_range(start=0, 5)
  File "<stdin>", line 1
SyntaxError: positional argument follows keyword argument

All Python function parameters can actually be specified by name (if we wanted to do so). There are some limits, however: keyword arguments must follow positional arguments and you can’t specify the same argument more than once. Our general practice is to provide required arguments (those without default values) by position (without the name) and optional arguments by name.

>>> help(print)
Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.

A common place to use keyword arguments is with print, where you will likely only want to modify one of the many optional arguments, e.g. the separator (sep) or the end, but not all.

Modules

What is a module? A module is a collection of related functions and variables. Why we do have modules? To organize and distribute code in a way that minimizes naming conflicts. We have already been using modules (e.g. math and turtle) and you have actually all written modules! Every .py file is a module.

Let’s consider the linked my_module as an example. my_module includes a constant and several functions. After importing my_module we can use those functions like any of those in math or the other modules we have used.

import my_module
my_module.a()
my_module.b(10, 15)
my_module.c("this is a test")
my_module.SOME_CONSTANT
Loaded my_module
Loaded my_module
Importing the module my_module
10
25
'tt'
10

What about help?

help(my_module)
Help on module my_module:

NAME
    my_module - Some basic functions to illustrate how modules work

DESCRIPTION
    A more detailed description of the module.

FUNCTIONS
    a()
        Prints out the number 10

    b(x, y)
        Returns x plus y

    c(some_string)
        Returns the first and last character of some_string

DATA
    SOME_CONSTANT = 10

FILE
    /home/runner/work/csci146f25/csci146f25/site/classes/my_module.py

That multi-line comment at the top of the file is also a docstring:

  • The NAME and brief description come from the filename and that docstring
  • DESCRIPTION is any subsequent lines in that docstring
  • FUNCTIONS enumerates the functions and the description from their docstrings
  • DATA enumerates any constants

Importing

What happens when we import a module? Python executes the Python file.

Where did the __pycache__ folder come from? When we import a module, Python compiles to bytecode in a “.pyc” file. This lower-level representation is more efficient to execute. These files aren’t important for this class, but I want you to be aware of where those files are coming from…

So adding a print statement, e.g. print("Loaded my_module"), to our Python module file will print that message on import.

>>> import my_module
>>> 

Why didn’t it print? Python doesn’t re-import modules that are already imported. Why does that behavior make sense? What if multiple modules import that same module, e.g. math? What if two modules import each other?

As a practical matter that means if we change our module we will need to restart the Python console (click on the trash can icon next to the Terminal name and then open the shell again with ) or use the explicit reload function:

>>> import importlib
>>> importlib.reload(my_module)
Loaded my_module
<module 'my_module' from 'my_module.py'>

Run vs. Import

When we click the button, we are “running” our Python programs. We could also have been importing them. When would you want to do one or the other?

Think about our Cryptography assignment. We could imagine using our functions in a program that help people securely communicate with each other, or that other programmers might want to use our functions in their own communication systems. For the former, we would want to encrypt/decrypt when our module is run. For the latter, we would want to make our code “importable” without actually invoking any of the functions.

Python has a special variable __name__ that can be used to determine whether our module is being run or being imported. When the file is run, Python automatically sets that variable to be "__main__". If a file is imported, Python sets that variable to be the filename as a string (without the “.py” extension).

We typically use this variable in a conditional at the end of the file that changes the behavior depending on the context. For example:

if __name__ == "__main__":
    print("Running the module")
else:
    print("Importing the module")

In most cases, you will only have the “if” branch (you will only be doing something if the program is run).

For example, when prompting users for input in Programming Assignment 4, we would do so only if the program is being run (not imported). Gradescope imports your files so that it can test functions without necessarily simulating all of the user interactions.

Command line (and the Terminal)

In many programming assignments and examples we have either “hardcoded” the input, e.g., “pokemon.txt” filename above, or we have solicited input from the user (via the input function) to control the execution of program. As you test your programs, typing those inputs each time probably gets a little tedious. There must be a different way…

What does the button actually do?

The button actually invokes the Python program you installed on your computer at the beginning of the semester. Let’s unpack what it’s doing. First, either click Python Shell or the button and then type:

>>> exit()

You’re now in a Terminal or Command Prompt that you can use to (more generally) interact with your system. On Mac, you’ll see a $ which is the Terminal’s equivalent of Python’s >>> prompt. On Windows, you’ll see >. On Mac, you can get back to Python by typing (without the $)

$ python3

On Windows, you would type (without the >):

> python

Well that just brought us back to where we started. Instead, to run a program in some my_program.py, we can pass the filename to our Python program:

$ python3 my_program.py

or

> python my_program.py

Again, you should not type $ or >.

Why is this useful? For one, it gives us a way to run our programs from the command line. More importantly, it allows us to pass command line arguments to our programs. Any additional arguments after the script name become the command line arguments to the script.

What are command line arguments? Like function arguments/parameters, command line arguments, are values passed to a Python program that will affect its execution. We use function parameters to change the inputs for our function. Command line arguments do the same for a program. Instead of using the input function to solicit input from the user to control the execution of our programs, we can specify those “inputs” on the command line as command line arguments. Doing so would facilitate controlling our programs in an automated way.

The why of the command line is a much larger question that we won’t fully experience in this class. From my own personal experience, being able to efficiently use a command line environment (and write programs to be used in that environment) will make you a much more productive and effective at data analysis and other computational tasks.

For example, are you curious how many lines of code you’ve written so far this semester? Here is a function to count the non-empty lines in a file. But how can we run this on every file in your cs146 folder?

def count_lines(filename):
    """
    Count non-empty lines in file

    Args:
        filename: File to examine
    
    Return: Count of non-empty lines
    """
    with open(filename, "r") as file:
        count = 0
        for line in file:
            if line.strip() != "":
                count +=1
        return count

We could manually make a list of all the files, but that is slow and error prone. Instead we would like to solve this problem programmatically. The command line can help us do so. It provides a mechanism for programmatically interacting with your computer, e.g. programmatically accessing directories, files, other programs and more. Let’s learn how to make that work.

$ python3 line_counter.py *.py
Total lines: 1074

A Command Line Example

We will use sys_args.py as our working example. First run this program with the button.

Arguments: ['sys_args.py']
0: sys_args.py

This program is meant to print arguments that are passed to the program.

With the Python module sys (short for “system”) there is a variable argv that is set to be a list of the command line arguments. The first element of this list is always the path of the program that is executing.

$ python3 sys_args.py these are some arguments in 2025
Arguments: ['sys_args.py', 'these', 'are', 'some', 'arguments']
0: sys_args.py
1: these
2: are
3: some
4: arguments
5: in
6: 2025

Returning to our line counter

In our earlier examples usage we use *.py as a wildcard (or globbing) that the terminal expands into all files that end in “.py”, i.e. that was equivalent to

$ python3 line_counter.py my_module.py sys_args.py ...

We can write our Python code to process any number of files provided on the command line. Here we use a for loop to iterate through all the files provided on the command line and thus in the sys.argv list (recall that the first element, at index 0, is always the name of the program that is executing). With that small amount of code, we now have a very useful (and efficient) tool. Add the following at the end of your line_counter.py file:

if __name__ == "__main__":
    if len(sys.argv) == 1:
        # Check that at least one file is provided on the command line
        print("Usage: python line_counter.py <1 or more files>")
    else:
        count = 0
        # Process all of the command line arguments (after the name of the program that is always at index 0)
        for filename in sys.argv[1:]:
            count += count_lines(filename)
        print("Total lines:", count)

Could we have accomplished the same task purely within Python, without using the command line environment? Yes, although the resulting approach would be less flexible. For example, we could use the listdir function on the os module to return a list of all the files in the current directory and then filter that list for just those files with names ending in “.py”:

import os
filenames = os.listdir()

count = 0
for filename in filenames:
    if filename.endswith(".py"):
        count += count_lines(filename)
print("Total lines:", count)

While this code may seem simpler than the approach above, it has several assumptions built-in which may make it less flexible. For example, we are only interested in files in the current directory and only files ending in “.py”. If we want to look at files in a different directory or with different/multiple file endings we will need to modify our program. In contrast, our approach using the command line works for all those scenarios without any modification. For example,

$ python line_counter.py *.py *.txt

counts the lines both Python files and the text files by expanding both wildcards. In this respect, the command line environment “augments” the capabilities of our Python programs, which makes it more flexible.