Let’s develop some functions to compute statistics about the words in a sentence. To do so we are first going to split the sentence into a list of words. Strings have a method to do exactly that, split.
help(str.split)
Help on method_descriptor:
split(self, /, sep=None, maxsplit=-1) unbound builtins.str method
Return a list of the substrings in the string, using sep as the separator string.
sep
The separator used to split the string.
When set to None (the default value), will split on any whitespace
character (including \n \r \t \f and spaces) and will discard
empty strings from the result.
maxsplit
Maximum number of splits.
-1 (the default value) means no limit.
Splitting starts at the front of the string and works to the end.
Note, str.split() is mainly useful for data that has been intentionally
delimited. With natural text that includes punctuation, consider using
the regular expression module.
"this is a sentence".split()
['this', 'is', 'a', 'sentence']
Let’s develop some functions for computing statistics about lists of words, specifically
average_word_length(words)
longest_word(words)
shortest_word(words)
Let’s start with average_word_length. What components will we need in that function? What about in longest_word or shortest_word? For the latter, can we use the built-in min and max function we saw previously?
TipThinking through these functions
We will need a loop to iterate through the list of words. Specifically, we can use a for loop since we know the number of iterations (the length of the list) ahead of time. One of the key challenges here is that we are not comparing the strings directly, e.g., “the” vs. “antidisestablishmentarianism”, but the lengths of those strings. In other words, we will first need to determine the lengths of each string, then compute the average, etc. If we just use max(words), we will get the lexicographically largest word not the longest word. As an example, consider:
words ="this is a sentence".split()max(words)
'this'
Let’s start with this template and assume the list we are processing is non-empty.
TipPossible implementation of shortest_word and longest_word
def shortest_word(words):""" Find shortest string in collection of strings Args: words: Non-empty collection of strings Returns: shortest string """ shortest = words[0]for word in words:iflen(word) <len(shortest): shortest = wordreturn shortestdef longest_word(words):""" Find longest string in collection of strings Args: words: Non-empty collection of strings Returns: longest string """ longest = words[0]# To avoid comparing words[0] twice, you could use words[1:]for word in words: iflen(word) >len(longest): longest = wordreturn longest
NoteAlternate implementations using Python features we haven’t discussed
There is another way we can write this using Python features we haven’t discussed yet. This involves passing a function as an argument to the max function (yes, we can pass a function itself as an argument to another function!).
longest =max(words, key=len)
Here key is an optional keyword parameter that expects a function as its argument. That function is applied to each element of the list before performing the comparison, i.e., comparing key(words[0]) to key(words[1]). Effectively it is performing a similar computation to our implementation above. Although we haven’t discussed it yet in class, we can assign functions to variables, pass them as arguments, etc.
Why did we use lists as the input?
Why did we use lists as the input to our functions? Could we not just have processed the input string one word at a time and not even created the list in the first place, e.g., something like the following:
for word in sentence.split():# Update average length numerator/denominator, etc.
Here we are analyzing a single string, but we can imagine other sources of words we might want to analyze, like a file of words. We can use data structures, like a list, to decouple components within program. In this case, we separate the source of the input from the calculations. We can use any input, strings, files, etc, that can be read into a list without changing our analysis functions (the average length, etc.)
Files
To date, all of our data has been ephemeral, like the sentence examples we just tested. But most data analyses of any scale will start and or end with data stored in a file. How do we read and write to files?
open("filename", "r")
The first argument is the path to the file. Note Python starts looking in the current working directory (typically the same directory as your script). If your file is elsewhere you will need to supply the necessary path. The second argument is the mode, e.g. 'r' for reading, 'w' for writing, etc.
Let’s open a file of English words for reading (small-file.txt):
We don’t need to know what a TextIOWrapper is. That is part of Python’s internal implementation. Instead, it’s important for us to know that the file object created by open is iterable, i.e., can read through all the lines in the file with a for loop (see below) and it has methods like read, for reading the entire contents of the file.
Once we have opened the file we can read all of the lines with a for loop (note there are other ways to read a file, but we will use for loops most frequently). That is we use the file as the loop sequence.
for<loop variable>in<file variable>:<loop body>
The loop body will get executed for each line of the file with the loop variable assigned the line (as a string) including any newline (i.e. return or '\n' characters). For example:
for line infile:print(line)
hello world
how are you
cat
dog
Notice the empty lines, these result from the newlines in the file itself and the newline added by print. That is, the contents of the file are really equivalent to:
hello world\nhow are you\ncat\ndog\n"
When Python reads each line of file, it includes the newline, i.e., the first value for line is "hello world\n". By default print adds its own newline, so the result is to print hello world\n\n (note the two newline characters). We typically don’t want the newline from the file, so we often use string’s strip method to remove it from each line after reading it from the file.
help(str.strip)
Help on method_descriptor:
strip(self, chars=None, /) unbound builtins.str method
Return a copy of the string with leading and trailing whitespace removed.
If chars is given and not None, remove characters in chars instead.
a ="string with newline\n"a.strip()
'string with newline'
If we try to run our loop again (with strip added), e.g.
for line infile:print(line.strip())
nothing will be printed. This is not unexpected. The file object maintains state, specifically a pointer to how much of the file has been read. When we first open the file, the pointer “points to” the beginning of the file. Once we have read the file it points to the end. Thus, there is nothing more to read. There are methods that we can use to reset the pointer, or we can close and then reopen the file.
All open files need to be closed (with the close method, e.g. file.close()). This is especially important for writing to files as it will force the data to actually be written to the disk/file system. You can do so manually, but it is easy to forget, and there are error situations where you may not be able to explicitly call close. Best practices are to use with blocks, which ensure that the file always (automatically) closes for you. For example:
withopen("filename", "r") asfile:# Work with file object# File is automatically closed when you exit the with block
In our class, the expectation is that you will always use with blocks when reading files.
Let’s put this together with the functions that we wrote earlier to generate basic statistics about pokemon.txt, which was generated from here (Pokémon with spaces in their names were removed).
>>> file_stats("pokemon.txt")Number of words: 997Longest word: CrabominableShortest word: MukAvg. word length: 7.592778335005015
Notice that almost all the code is shared between sentence_stats and file_stats. This is nicely DRY! If you ever find yourself copying and pasting code, make a function instead.
Note that these examples assume that the file you are reading and the program are in the same directory.
Optional Parameters
We have used range extensively when writing for-loops. Sometimes we include a start, and sometimes we include a step:
>>> help(range)
Help on class range in module builtins:
class range(object)
| range(stop) -> range object
| range(start, stop[, step]) -> range object
This works because Python supports optional arguments, e.g. the optional “step”. How would we implement our own version of range? Let’s fill in (optional_parameters.py) with our own implementation:
TipPossible implementation of my_range:
def my_range(start, stop, step):""" Return a range Args: start: inclusive start index stop: exclusive stop index step: range increment Returns: A list of integers """ i = start r = []while i < stop: r.append(i) i += stepreturn r
What if we want a default step of 1 (similar to the built-in range)? Maybe we can write a new function?
This is DRY-ish, but we can condense these two functions into one by setting a default value for step. We can do so by providing a value in the function header as shown below. We describe those parameters with default values as “optional”, i.e., we no longer have to provide a value for that parameter when we call the function. If we don’t provide a value, Python uses the default value specified in the header.
def my_range(start, stop, step=1):
Now we can use the same function for the two different use cases. More generally, optional parameters are useful when there is a sensible default value (i.e. stepping by one) but the caller might want/need to change that value sometimes.
Note that you can also specify parameters by name, which is helpful if there are many optional parameters and you only want to change one or two.
All Python function parameters can actually be specified by name (if we wanted to do so). There are some limits, however: keyword arguments must follow positional arguments and you can’t specify the same argument more than once. Our general practice is to provide required arguments (those without default values) by position (without the name) and optional arguments by name.
>>> help(print)
Help on built-in function print in module builtins:
print(...)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file: a file-like object (stream); defaults to the current sys.stdout.
sep: string inserted between values, default a space.
end: string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
A common place to use keyword arguments is with print, where you will likely only want to modify one of the many optional arguments, e.g. the separator (sep) or the end, but not all.
Modules
What is a module? A module is a collection of related functions and variables. Why we do have modules? To organize and distribute code in a way that minimizes naming conflicts. We have already been using modules (e.g. math and turtle) and you have actually all written modules! Every .py file is a module.
Let’s consider the linked my_module as an example. my_module includes a constant and several functions. After importing my_module we can use those functions like any of those in math or the other modules we have used.
import my_modulemy_module.a()my_module.b(10, 15)my_module.c("this is a test")my_module.SOME_CONSTANT
Loaded my_module
Loaded my_module
Importing the module my_module
10
25
'tt'
10
What about help?
help(my_module)
Help on module my_module:
NAME
my_module - Some basic functions to illustrate how modules work
DESCRIPTION
A more detailed description of the module.
FUNCTIONS
a()
Prints out the number 10
b(x, y)
Returns x plus y
c(some_string)
Returns the first and last character of some_string
DATA
SOME_CONSTANT = 10
FILE
/home/runner/work/csci146f25/csci146f25/site/classes/my_module.py
That multi-line comment at the top of the file is also a docstring:
The NAME and brief description come from the filename and that docstring
DESCRIPTION is any subsequent lines in that docstring
FUNCTIONS enumerates the functions and the description from their docstrings
DATA enumerates any constants
Importing
What happens when we import a module? Python executes the Python file.
Where did the __pycache__ folder come from? When we import a module, Python compiles to bytecode in a “.pyc” file. This lower-level representation is more efficient to execute. These files aren’t important for this class, but I want you to be aware of where those files are coming from…
So adding a print statement, e.g. print("Loaded my_module"), to our Python module file will print that message on import.
>>>import my_module>>>
Why didn’t it print? Python doesn’t re-import modules that are already imported. Why does that behavior make sense? What if multiple modules import that same module, e.g. math? What if two modules import each other?
As a practical matter that means if we change our module we will need to restart the Python console (click on the trash can icon next to the Terminal name and then open the shell again with ) or use the explicit reload function:
When we click the button, we are “running” our Python programs. We could also have been importing them. When would you want to do one or the other?
Think about our Cryptography assignment. We could imagine using our functions in a program that help people securely communicate with each other, or that other programmers might want to use our functions in their own communication systems. For the former, we would want to encrypt/decrypt when our module is run. For the latter, we would want to make our code “importable” without actually invoking any of the functions.
Python has a special variable __name__ that can be used to determine whether our module is being run or being imported. When the file is run, Python automatically sets that variable to be "__main__". If a file is imported, Python sets that variable to be the filename as a string (without the “.py” extension).
We typically use this variable in a conditional at the end of the file that changes the behavior depending on the context. For example:
if__name__=="__main__":print("Running the module")else:print("Importing the module")
In most cases, you will only have the “if” branch (you will only be doing something if the program is run).
For example, when prompting users for input in Programming Assignment 4, we would do so only if the program is being run (not imported). Gradescope imports your files so that it can test functions without necessarily simulating all of the user interactions.
Command line (and the Terminal)
In many programming assignments and examples we have either “hardcoded” the input, e.g., “pokemon.txt” filename above, or we have solicited input from the user (via the input function) to control the execution of program. As you test your programs, typing those inputs each time probably gets a little tedious. There must be a different way…
What does the button actually do?
The button actually invokes the Python program you installed on your computer at the beginning of the semester. Let’s unpack what it’s doing. First, either click Python Shell or the button and then type:
>>> exit()
You’re now in a Terminal or Command Prompt that you can use to (more generally) interact with your system. On Mac, you’ll see a $ which is the Terminal’s equivalent of Python’s >>> prompt. On Windows, you’ll see >. On Mac, you can get back to Python by typing (without the $)
$ python3
On Windows, you would type (without the >):
> python
Well that just brought us back to where we started. Instead, to run a program in some my_program.py, we can pass the filename to our Python program:
$ python3 my_program.py
or
> python my_program.py
Again, you should not type $ or >.
Why is this useful? For one, it gives us a way to run our programs from the command line. More importantly, it allows us to pass command line arguments to our programs. Any additional arguments after the script name become the command line arguments to the script.
What are command line arguments? Like function arguments/parameters, command line arguments, are values passed to a Python program that will affect its execution. We use function parameters to change the inputs for our function. Command line arguments do the same for a program. Instead of using the input function to solicit input from the user to control the execution of our programs, we can specify those “inputs” on the command line as command line arguments. Doing so would facilitate controlling our programs in an automated way.
The why of the command line is a much larger question that we won’t fully experience in this class. From my own personal experience, being able to efficiently use a command line environment (and write programs to be used in that environment) will make you a much more productive and effective at data analysis and other computational tasks.
For example, are you curious how many lines of code you’ve written so far this semester? Here is a function to count the non-empty lines in a file. But how can we run this on every file in your cs146 folder?
def count_lines(filename):""" Count non-empty lines in file Args: filename: File to examine Return: Count of non-empty lines """withopen(filename, "r") asfile: count =0for line infile:if line.strip() !="": count +=1return count
We could manually make a list of all the files, but that is slow and error prone. Instead we would like to solve this problem programmatically. The command line can help us do so. It provides a mechanism for programmatically interacting with your computer, e.g. programmatically accessing directories, files, other programs and more. Let’s learn how to make that work.
$ python3 line_counter.py *.py
Total lines: 1074
A Command Line Example
We will use sys_args.py as our working example. First run this program with the button.
Arguments: ['sys_args.py']
0: sys_args.py
This program is meant to print arguments that are passed to the program.
With the Python module sys (short for “system”) there is a variable argv that is set to be a list of the command line arguments. The first element of this list is always the path of the program that is executing.
$ python3 sys_args.py these are some arguments in 2025
Arguments: ['sys_args.py', 'these', 'are', 'some', 'arguments']
0: sys_args.py
1: these
2: are
3: some
4: arguments
5: in
6: 2025
Navigating with the Terminal.
There is also a concept of a “working directory”, which is where in file system we are executing our program. When we invoke Python in the terminal, we will need to navigate within the terminal to the directory containing our program.
The key commands will use to navigate the terminal are:
Command
Description
ls
List files
cd dir
Change directory to dir
cd ..
Change to parent directory (i.e., go up one level of hierarchy)
cd
Change to home directory
pwd
Print the the path of the current working directory
more <file>
Show contents of file one screen full at a time (hit q to exit)
The Windows equivalent to terminal is cmd. The mapping between commands for navigating within the terminal/shell are:
Linux/OSX
Windows
ls
dir
cd
cd
cd /home/philip/
cd C:\Users\philip
With these commands we are navigating the same file system and directories you see with your graphical browser, but doing so in a text-based programmatic environment.
For example you will likely need to navigate to the directory that contains your Python script. A protocol to do so:
Find the directory containing your Python program. For the file /Users/philip/Documents/cs146/sys_args.py, the directory is everything up to the last /, i.e. /Users/philip/Documents/cs146. These slashes delimit each directory – on Windows they are backward slashes </kbd>.
In the terminal at the command prompt, e.g. at the $, type cd for “change directory” then enter the path. For example:
$ cd /Users/philip/cs146/
cd only works on directories. If you have any spaces in your path, you will need to add quotes around the path so it is interpreted as a single string (you can use left and right arrows to move in your command to edit it). For example:
$ cd "/Users/philip/cs146/"
Returning to our line counter
In our earlier examples usage we use *.py as a wildcard (or globbing) that the terminal expands into all files that end in “.py”, i.e. that was equivalent to
We can write our Python code to process any number of files provided on the command line. Here we use a for loop to iterate through all the files provided on the command line and thus in the sys.argv list (recall that the first element, at index 0, is always the name of the program that is executing). With that small amount of code, we now have a very useful (and efficient) tool. Add the following at the end of your line_counter.py file:
if__name__=="__main__":iflen(sys.argv) ==1:# Check that at least one file is provided on the command lineprint("Usage: python line_counter.py <1 or more files>")else: count =0# Process all of the command line arguments (after the name of the program that is always at index 0)for filename in sys.argv[1:]: count += count_lines(filename)print("Total lines:", count)
Could we have accomplished the same task purely within Python, without using the command line environment? Yes, although the resulting approach would be less flexible. For example, we could use the listdir function on the os module to return a list of all the files in the current directory and then filter that list for just those files with names ending in “.py”:
While this code may seem simpler than the approach above, it has several assumptions built-in which may make it less flexible. For example, we are only interested in files in the current directory and only files ending in “.py”. If we want to look at files in a different directory or with different/multiple file endings we will need to modify our program. In contrast, our approach using the command line works for all those scenarios without any modification. For example,
$ python line_counter.py *.py *.txt
counts the lines both Python files and the text files by expanding both wildcards. In this respect, the command line environment “augments” the capabilities of our Python programs, which makes it more flexible.