Lesson 2 - Conditional, Functions, User-input and Errors

Author

Valentine Gilbart

Published

May 13, 2026

Introduction

Aim of the class

At the end of this class, you will be able to:

  • Use conditionals to make choices in your code
  • Understand some major data structures in Python
  • Create simple functions
  • Run command-line scripts with user input
  • Understand and raise exceptions

Making choices

In our last lesson, we discovered something suspicious was going on in our inflammation data by drawing some plots. How can we use Python to automatically recognize the different features we saw, and take a different action for each? In this lesson, we’ll learn how to write code that runs only when certain conditions are true.

Conditionals

Conditionals allows you to make decisions in your code based on certain conditions.

if something is true:
    do task a
otherwise:
    do task b

For example we can ask Python to take different actions, depending on a condition, with an if statement:

num = 37
if num > 100:
    print('greater')
else:
    print('not greater')
print('done')
not greater
done

If the test that follows the if statement is true, the body of the if (i.e., the set of lines indented underneath it) is executed, and “greater” is printed. If the test is false, the body of the else is executed instead, and “not greater” is printed. Only one or the other is ever executed before continuing on with program execution to print “done”.

Figure 4.1: Conditional schema

Conditional statements don’t have to include an else. If there isn’t one, Python simply does nothing if the test is false:

num = 53
print('before conditional...')
if num > 100:
    print(num, 'is greater than 100')
print('...after conditional')
before conditional...
...after conditional

Comparison and membership operators

Here we used a comparison operator (>) to compare the value of num to 100. Here are other comparison operators:

Operator Name
== Equal
!= Not equal
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
# Example
2 == 1 + 1
True
Note

Note that to test for equality we use a double equals sign == rather than a single equals sign = which is used to assign values.

Warning

You should never use equalty operators (==or !=) with floats or complex values.

# Example
2.1 + 3.2 == 5.3
False

This is a floating point arithmetic problem seen in other programming languages. It is due to the difficulty of having a fixed number of binary digits (bits) to accurately represent some decimal number. This leads to small rounding errors in calculations.

2.1 + 3.2 
5.300000000000001

If you need to use equalty operators, do it with a degree of freedom:

tol = 1e-6 ; abs((2.1 + 3.2) - 5.3) < tol
True

The result of a comparison operator is a boolean value (True or False), which can be used in an if statement to control the flow of the program.

Booleans represent one of two values: True or False. When you compare two values, the expression is evaluated and Python returns the Boolean answer:

num = 37
print(num > 100)
False

There are also membership operators (in and not in). They are used to test if a value is present in a sequence (like a list or a string):

Operator Description
in Returns True if a sequence with the specified value is present in the object
not in Returns True if a sequence with the specified value is not present in the object
seq = ['ATGAAGGGTCCAAAA', 'AGTCCCCGTATGAT', 'ACCT', 'ACCT']

print('ACCT' in seq)
print('G' in seq[-1])
True
False

elif statements

We can also chain several tests together using elif which is short for “else if”. The elif keyword is Python’s way of saying “if the previous conditions were not true, then try this condition”. The following Python code uses elif to print the sign of a number.

num = -3

if num > 0:
    print(num, 'is positive')
elif num == 0:
    print(num, 'is zero')
else:
    print(num, 'is negative')
-3 is negative

We can also combine tests using logical operators and, or and not to create more complex conditions.

Logical operators

Logical operators are used to combine conditional statements:

Operator Description
and Returns True if both statements are true
or Returns True if one of the statements is true
not Reverse the result, returns False if the result is true

Result of the and operator:

Boolean 1 Boolean 2 Result
True True True
True False False
False True False
False False False
# Example 
False and False, False and True, True and False, True and True
(False, False, False, True)

Result of the or operator:

Boolean 1 Boolean 2 Result
True True True
True False True
False True True
False False False
# Example 
False or False, False or True, True or False, True or True
(False, True, True, True)
# Example 
True or not True
True
if (1 > 0) and (-1 >= 0):
    print('both parts are true')
else:
    print('at least one part is false')
at least one part is false
if (1 < 0) or (1 >= 0):
    print('at least one test is true')
at least one test is true

Exercise

ImportantExercise

With gene1_expression and gene2_expression given, are these 2 codes equivalent?

# Code A
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
elif gene1_expression < gene2_expression:
  print("Gene 2 has higher expression level.")
else:
  print("Gene 1 and Gene 2 have the same expression level.")
# Code B
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
else:
  if gene1_expression < gene2_expression:
    print("Gene 2 has higher expression level.")
  else:
    print("Gene 1 and Gene 2 have the same expression level.")

Yes! Both codes will produce the same output for any given values of gene1_expression and gene2_expression. The first code uses elif to handle the second condition, while the second code uses a nested if statement. However, both structures will evaluate the conditions in the same way and yield the same results.

Understand better with an example:

gene1_expression = 100
gene2_expression = 50

# Code A
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
elif gene1_expression < gene2_expression:
  print("Gene 2 has higher expression level.")
else:
  print("Gene 1 and Gene 2 have the same expression level.")

# Code B
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
else:
  if gene1_expression < gene2_expression:
    print("Gene 2 has higher expression level.")
  else:
    print("Gene 1 and Gene 2 have the same expression level.")
Gene 1 has higher expression level.
Gene 1 has higher expression level.
gene1_expression = 100
gene2_expression = 100

# Code A
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
elif gene1_expression < gene2_expression:
  print("Gene 2 has higher expression level.")
else:
  print("Gene 1 and Gene 2 have the same expression level.")

# Code B
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
else:
  if gene1_expression < gene2_expression:
    print("Gene 2 has higher expression level.")
  else:
    print("Gene 1 and Gene 2 have the same expression level.")
Gene 1 and Gene 2 have the same expression level.
Gene 1 and Gene 2 have the same expression level.
gene1_expression = 50
gene2_expression = 100

# Code A
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
elif gene1_expression < gene2_expression:
  print("Gene 2 has higher expression level.")
else:
  print("Gene 1 and Gene 2 have the same expression level.")

# Code B
if gene1_expression > gene2_expression:
  print("Gene 1 has higher expression level.")
else:
  if gene1_expression < gene2_expression:
    print("Gene 2 has higher expression level.")
  else:
    print("Gene 1 and Gene 2 have the same expression level.")
Gene 2 has higher expression level.
Gene 2 has higher expression level.
ImportantExercise

Are these two codes equivalent?

# Code A
if "ATG" in dna_sequence:
  print("Start codon found.")  
elif "TAG" in dna_sequence:
  print("Stop codon found.")  
else:
  print("No interesting codon found.") 
# Code B
if "ATG" in dna_sequence:
  print("Start codon found.")  
  if "TAG" in dna_sequence:
    print("Stop codon found.")  
else:
  print("No interesting codon found.") 

Understand better with an example:

dna_sequence = "ATGCTAGCTAGCTAG"

# Code A
if "ATG" in dna_sequence:
  print("Start codon found.")  
elif "TAG" in dna_sequence:
  print("Stop codon found.")  
else:
  print("No interesting codon found.") 
Start codon found.
# Code B
if "ATG" in dna_sequence:
  print("Start codon found.")  
  if "TAG" in dna_sequence:
    print("Stop codon found.")  
else:
  print("No interesting codon found.") 
Start codon found.
Stop codon found.

The equivalent code to Code A would be:

if "ATG" in dna_sequence:
  print("Start codon found.")  
else:
  if "TAG" in dna_sequence:
    print("Stop codon found.")  
  else:
    print("No interesting codon found.") 
Start codon found.
ImportantExercise

This code is incorrect, can you tell why?

x = 7

if x > 5:
  print("x is greater than 5")  
  if x > 10:
    print("x is greater than 10")  
  elif x = 10: 
    print("x equals 10") 
  else:
    print("x is less than 10")  
x = 7

if x > 5:
  print("x is greater than 5")  
  if x > 10:
    print("x is greater than 10")  
  elif x == 10: 
    print("x equals 10") 
    # It must be ==, the comparison operator 
    # instead of =, the assignement operator
  else:
    print("x is less than 10")  
x is greater than 5
x is less than 10

Checking our data

Now that we’ve seen how conditionals work, we can use them to check for the suspicious features we saw in our inflammation data.

From the first couple of plots, we saw that maximum daily inflammation exhibits a strange behavior and raises one unit a day. Wouldn’t it be a good idea to detect such behavior and report it as suspicious? Let’s do that! However, instead of checking every single day of the study, let’s merely check if maximum inflammation in the beginning (day 0) and in the middle (day 20) of the study are equal to the corresponding day numbers.

import pandas as pd

data = pd.read_csv('data/inflammation-01.csv', index_col=0)

max_inflammation_0 = data.max(axis=0).iloc[0]
max_inflammation_20 = data.max(axis=0).iloc[20]

if max_inflammation_0 == 0 and max_inflammation_20 == 20:
    print('Suspicious looking maxima!')
Suspicious looking maxima!

We also saw a different problem in the third dataset; the minima per day were all zero (looks like a healthy person snuck into our study). And if neither of these conditions are true, we can use else to give the all-clear:

data = pd.read_csv('data/inflammation-01.csv', index_col=0)

max_inflammation_0 = data.max(axis=0).iloc[0]
max_inflammation_20 = data.max(axis=0).iloc[20]

if max_inflammation_0 == 0 and max_inflammation_20 == 20:
    print('Suspicious looking maxima!')
elif data.min(axis=0).sum() == 0:
    print('Minima add up to zero!')
else:
    print('Seems OK!')
Suspicious looking maxima!

With another dataset:

data = pd.read_csv('data/inflammation-03.csv', index_col=0)

max_inflammation_0 = data.max(axis=0).iloc[0]
max_inflammation_20 = data.max(axis=0).iloc[20]

if max_inflammation_0 == 0 and max_inflammation_20 == 20:
    print('Suspicious looking maxima!')
elif data.min(axis=0).sum() == 0:
    print('Minima add up to zero!')
else:
    print('Seems OK!')
Minima add up to zero!

In this way, we have asked Python to do something different depending on the condition of our data. Here we printed messages in all cases, but we could also imagine not using the else catch-all so that messages are only printed when something is wrong, freeing us from having to manually examine every plot for features we’ve seen before.

ImportantExercise

Instead of just printing if a file falls into one or the other category, we want to save in lists the name of the files that have suspicious maximum trend, a suspicious minimum trend and the ones that seem normal at the moment.

Modify the following code:

data = pd.read_csv('data/inflammation-03.csv', index_col=0)

max_inflammation_0 = data.max(axis=0).iloc[0]
max_inflammation_20 = data.max(axis=0).iloc[20]

if max_inflammation_0 == 0 and max_inflammation_20 == 20:
    print('Suspicious looking maxima!')
elif data.min(axis=0).sum() == 0:
    print('Minima add up to zero!')
else:
    print('Seems OK!')
Minima add up to zero!

You will need to loop through all the files in the data folder, and create three lists maxima_suspicious_files, minima_suspicious_files and normal_files to save the names of the files that fall into each category.

Remember to make use of the function list.append(element) to add an element to a list.

import glob
import pandas as pd
import matplotlib.pyplot as plt

filenames = sorted(glob.glob('data/inflammation*.csv'))

maxima_suspicious_files = []
minima_suspicious_files = []
normal_files = []

for filename in filenames:
    data = pd.read_csv(filename, index_col=0)
    max_inflammation_0 = data.max(axis=0).iloc[0]
    max_inflammation_20 = data.max(axis=0).iloc[20]
    if max_inflammation_0 == 0 and max_inflammation_20 == 20:
        maxima_suspicious_files.append(filename)
    elif data.min(axis=0).sum() == 0:
        minima_suspicious_files.append(filename)
    else:
        normal_files.append(filename)

print("maxima_suspicious_files", maxima_suspicious_files)
print("minima_suspicious_files", minima_suspicious_files)
print("normal_files", normal_files)
maxima_suspicious_files ['data/inflammation-01.csv', 'data/inflammation-02.csv', 'data/inflammation-04.csv', 'data/inflammation-05.csv', 'data/inflammation-06.csv', 'data/inflammation-07.csv', 'data/inflammation-09.csv', 'data/inflammation-10.csv', 'data/inflammation-12.csv']
minima_suspicious_files ['data/inflammation-03.csv', 'data/inflammation-08.csv', 'data/inflammation-11.csv']
normal_files []

Dictionaries and Sets

Dictionnary

In the last exercise, we created three lists to store the names of the files that have suspicious maximum trend, a suspicious minimum trend and the ones that seem normal at the moment. We could also have stored this information in a dictionary, which is a data structure that allows us to store key: value pairs.

A dictionary is a collection which is ordered (as of Python >= 3.7), changeable and does not allow duplicates keys. Dictionaries are written with curly brackets {}, with keys and values. Keys must be unique and of an immutable type (like strings or numbers), while values can be of any type (like strings, numbers, lists, dictionnaries…) and can be duplicated.

organism1_genes = {
  #key: value;
  'BRCA1': 'DNA repair', 
  'TP53': 'Tumor suppressor', 
  'EGFR': 'Cell growth', 
  'MYC': 'Regulation of gene expression'
}

Dictionary items can be referred to by using the key name.

organism1_genes["BRCA1"]
'DNA repair'

Dictionaries have specific methods. Here are a few:

Method Description
.items() Returns a list containing a tuple for each key value pair
.keys() Returns a list containing the dictionary’s keys
.values() Returns a list of all the values in the dictionary
.pop() Removes the element with the specified key
.get() Returns the value of the specified key
ImportantExercise

Modify the previous code so that instead of having 3 lists maxima_suspicious_files, minima_suspicious_files and normal_files you have only one dictionnary, called classify_files. The values of the dictionary should be a list containing the name of the files, and the keys should be one of the following strings: "suspicious maxima", "suspicious minima" or "normal". How many files falls into each category?

You can initialize the dictionnary like so:

categorize_files = {
    "suspicious maxima": [],
    "suspicious minima": [],
    "normal": []
}
import glob
import pandas as pd
import matplotlib.pyplot as plt

filenames = sorted(glob.glob('data/inflammation*.csv'))

categorize_files = {
    "suspicious maxima": [],
    "suspicious minima": [],
    "normal": []
}

for filename in filenames:
    data = pd.read_csv(filename, index_col=0)
    max_inflammation_0 = data.max(axis=0).iloc[0]
    max_inflammation_20 = data.max(axis=0).iloc[20]
    if max_inflammation_0 == 0 and max_inflammation_20 == 20:
        categorize_files["suspicious maxima"].append(filename)
    elif data.min(axis=0).sum() == 0:
        categorize_files["suspicious minima"].append(filename)
    else:
        categorize_files["normal"].append(filename)

print("Number of maxima_suspicious_files:", len(categorize_files["suspicious maxima"]))
print("Number of minima_suspicious_files:", len(categorize_files["suspicious minima"]))
print("Number of normal_files:", len(categorize_files["normal"]))
Number of maxima_suspicious_files: 9
Number of minima_suspicious_files: 3
Number of normal_files: 0

You can loop through the keys of a dictionary using the method .keys(), through the values using the method .values(), and through both keys and values using the method .items().

Here is an example:

organism1_genes = {
  #key: value;
  'BRCA1': 'DNA repair', 
  'TP53': 'Tumor suppressor', 
  'EGFR': 'Cell growth', 
  'MYC': 'Regulation of gene expression'
}

for gene, function in organism1_genes.items():
    print(gene, "is involved in", function)
BRCA1 is involved in DNA repair
TP53 is involved in Tumor suppressor
EGFR is involved in Cell growth
MYC is involved in Regulation of gene expression
ImportantExercise

Loop through the categorize_files dictionary created in the previous exercise, and print for each file, the category in which it falls into.

The output printed could look like:

data/inflammation-01.csv falls under the suspicious maxima category
data/inflammation-02.csv falls under the suspicious maxima category
data/inflammation-04.csv falls under the suspicious maxima category
data/inflammation-05.csv falls under the suspicious maxima category
data/inflammation-06.csv falls under the suspicious maxima category
data/inflammation-07.csv falls under the suspicious maxima category
data/inflammation-09.csv falls under the suspicious maxima category
data/inflammation-10.csv falls under the suspicious maxima category
data/inflammation-12.csv falls under the suspicious maxima category
data/inflammation-03.csv falls under the suspicious minima category
data/inflammation-08.csv falls under the suspicious minima category
data/inflammation-11.csv falls under the suspicious minima category

Loop through the filenames list created by filenames = sorted(glob.glob('data/inflammation*.csv')) instead of the categorize_files dictionary. Print for each file, the category in which it falls into. You will need to use an if statement and the membership operator in to check if a file is in one of the two categories. The output printed could look like:

data/inflammation-01.csv falls under the suspicious maxima category
data/inflammation-02.csv falls under the suspicious maxima category
data/inflammation-03.csv falls under the suspicious minima category
data/inflammation-04.csv falls under the suspicious maxima category
data/inflammation-05.csv falls under the suspicious maxima category
data/inflammation-06.csv falls under the suspicious maxima category
data/inflammation-07.csv falls under the suspicious maxima category
data/inflammation-08.csv falls under the suspicious minima category
data/inflammation-09.csv falls under the suspicious maxima category
data/inflammation-10.csv falls under the suspicious maxima category
data/inflammation-11.csv falls under the suspicious minima category
data/inflammation-12.csv falls under the suspicious maxima category
for key, value in categorize_files.items():
  for file in value:
    print(file, "falls under the", key, "category")
data/inflammation-01.csv falls under the suspicious maxima category
data/inflammation-02.csv falls under the suspicious maxima category
data/inflammation-04.csv falls under the suspicious maxima category
data/inflammation-05.csv falls under the suspicious maxima category
data/inflammation-06.csv falls under the suspicious maxima category
data/inflammation-07.csv falls under the suspicious maxima category
data/inflammation-09.csv falls under the suspicious maxima category
data/inflammation-10.csv falls under the suspicious maxima category
data/inflammation-12.csv falls under the suspicious maxima category
data/inflammation-03.csv falls under the suspicious minima category
data/inflammation-08.csv falls under the suspicious minima category
data/inflammation-11.csv falls under the suspicious minima category
for filename in filenames:
  if filename in categorize_files["suspicious maxima"]:
    print(filename, "falls under the suspicious maxima category")
  elif filename in categorize_files["suspicious minima"]:
    print(filename, "falls under the suspicious minima category")
  else:
    print(filename, "does not fall under any category")
data/inflammation-01.csv falls under the suspicious maxima category
data/inflammation-02.csv falls under the suspicious maxima category
data/inflammation-03.csv falls under the suspicious minima category
data/inflammation-04.csv falls under the suspicious maxima category
data/inflammation-05.csv falls under the suspicious maxima category
data/inflammation-06.csv falls under the suspicious maxima category
data/inflammation-07.csv falls under the suspicious maxima category
data/inflammation-08.csv falls under the suspicious minima category
data/inflammation-09.csv falls under the suspicious maxima category
data/inflammation-10.csv falls under the suspicious maxima category
data/inflammation-11.csv falls under the suspicious minima category
data/inflammation-12.csv falls under the suspicious maxima category
ImportantExercise

From the dictionary organism1_genes created as example, get the value of the key BRCA1. If the key does not exist, return Unknown by default. Try your code before and after removing the BRCA1 key:value pair.

Check the help of get by running help(dict.get).

organism1_genes
{'BRCA1': 'DNA repair',
 'TP53': 'Tumor suppressor',
 'EGFR': 'Cell growth',
 'MYC': 'Regulation of gene expression'}
organism1_genes.get('BRCA1', 'Unknown')
'DNA repair'
organism1_genes.pop('BRCA1')
'DNA repair'
organism1_genes
{'TP53': 'Tumor suppressor',
 'EGFR': 'Cell growth',
 'MYC': 'Regulation of gene expression'}
organism1_genes.get('BRCA1', 'Unknown')
'Unknown'

Set

Let’s learn about another data structure: the set. It will be useful to check for common elements between two lists, for example, the list of files categorized as “suspicious maxima” and the list of files categorized as “suspicious minima”.

Set is a collection which is unordered and unindexed. It does not allow duplicate members (they will be ignored). Sets are written with curly brackets {}.

seq = {'BRCA1', 'TP53', 'EGFR', 'MYC'}

Once a set is created, you cannot change its items directly (as they don’t have index), but you modify the set by removing and adding items.

Sets have specific methods. Here are a few:

Method Description
.add() Adds an element to the set
 .difference()   Returns a set containing the difference between two sets
 .intersection()   Returns a set containing the intersection between two sets
 .union()   Returns a set containing the union of two sets
 .remove()  Remove the specified item
 .pop()  Removes a random element
ImportantExercise

Using the two following sets, create sets that represent:

  • the genes that are presents in both sets
  • the genes that are present in at least one of the two sets
  • the genes that are present in the first set but not in the second one
  • the genes that are present in the second set but not in the first one

Remove from both sets the genes that are presents in boths sets. You can loop through a list of common genes, and use the method .remove() on both sets to do that.

organism1_genes = {'BRCA1', 'TP53', 'EGFR', 'MYC'}
organism2_genes = {'TP53', 'MYC', 'KRAS', 'BRAF'}
common_genes = organism1_genes.intersection(organism2_genes)
union_genes = organism1_genes.union(organism2_genes)
gene_in1not2 = organism1_genes.difference(organism2_genes)
gene_in2not1 = organism2_genes.difference(organism1_genes)
print("Present in both sets :", common_genes)
print("Present in at least one of the two sets:", union_genes)
print("Present in the first set but not in the second one:", gene_in1not2)
print("Present in the second set but not in the first one:", gene_in2not1)

for gene in common_genes:
    organism1_genes.remove(gene)
    organism2_genes.remove(gene)
Present in both sets : {'MYC', 'TP53'}
Present in at least one of the two sets: {'BRAF', 'KRAS', 'MYC', 'EGFR', 'BRCA1', 'TP53'}
Present in the first set but not in the second one: {'BRCA1', 'EGFR'}
Present in the second set but not in the first one: {'BRAF', 'KRAS'}

Conversion between types

You can get the data type of any object by using the function type(). You can (more or less easily) convert between data types.

Function Description
bool() Convert to boolean type
int(), float() Convert between integer or float types
str() Convert to string type
list(), set() Convert between list, and set types
bool(1)
True
int(5.8) 
5
str(1)
'1'
list({1, 2, 3})
[1, 2, 3]
set([1, 2, 3, 3])
{1, 2, 3}

Exercise

ImportantExercise

Verify that no file has a suspicious maximum trend and a suspicious minimum trend at the same time. You will need to modify the following code, to allow a file to be categorized as “suspicious minima” even if it is already categorized as “suspicious maxima”, and vice versa. Don’t bother with the “normal” category, you can remove it as there was no files falling into this category.

You can keep using dictionnaries, and then transform the values into sets to check for common files between the two categories.

import glob
import pandas as pd
import matplotlib.pyplot as plt

filenames = sorted(glob.glob('data/inflammation*.csv'))

categorize_files = {
    "suspicious maxima": [],
    "suspicious minima": [],
    "normal": []
}

for filename in filenames:
    data = pd.read_csv(filename, index_col=0)
    max_inflammation_0 = data.max(axis=0).iloc[0]
    max_inflammation_20 = data.max(axis=0).iloc[20]
    if max_inflammation_0 == 0 and max_inflammation_20 == 20:
        categorize_files["suspicious maxima"].append(filename)
    elif data.min(axis=0).sum() == 0:
        categorize_files["suspicious minima"].append(filename)
    else:
        categorize_files["normal"].append(filename)

print("maxima_suspicious_files", categorize_files["suspicious maxima"])
print("minima_suspicious_files", categorize_files["suspicious minima"])
print("normal_files", categorize_files["normal"])
maxima_suspicious_files ['data/inflammation-01.csv', 'data/inflammation-02.csv', 'data/inflammation-04.csv', 'data/inflammation-05.csv', 'data/inflammation-06.csv', 'data/inflammation-07.csv', 'data/inflammation-09.csv', 'data/inflammation-10.csv', 'data/inflammation-12.csv']
minima_suspicious_files ['data/inflammation-03.csv', 'data/inflammation-08.csv', 'data/inflammation-11.csv']
normal_files []
import glob
import pandas as pd
import matplotlib.pyplot as plt

filenames = sorted(glob.glob('data/inflammation*.csv'))

categorize_files = {
    "suspicious maxima": [],
    "suspicious minima": [],
}

for filename in filenames:
    data = pd.read_csv(filename, index_col=0)
    max_inflammation_0 = data.max(axis=0).iloc[0]
    max_inflammation_20 = data.max(axis=0).iloc[20]
    if max_inflammation_0 == 0 and max_inflammation_20 == 20:
        categorize_files["suspicious maxima"].append(filename)
    if data.min(axis=0).sum() == 0:
        categorize_files["suspicious minima"].append(filename)

set_maxima = set(categorize_files["suspicious maxima"])
set_minima = set(categorize_files["suspicious minima"])

print("Intersection between maxima and minima suspicious files: ", set_maxima.intersection(set_minima))
Intersection between maxima and minima suspicious files:  set()

The set is empty, there are no files with both suspicious maximum and minimum trends.

Creating functions

We have created a workflow to check if our data looks suspicious. In python we can wrap this workflow into a function, which is a reusable block of code that performs a specific task. Functions allow us to organize our code and make it more modular and easier to read.

For example, instead of reading a whole block of code, we could create a function called detect_problems() that takes a filename as input and returns whether the data in the file is suspicious or not. But first let’s learn how to create a function in Python.

Note

A function usually takes some data as input (parameters that are required or optional), and usually returns an output (that can be of any type).

We already learned how to run a predefined function in the last lesson. You need to write its name followed by parenthesis. Parameters are added inside the parenthesis as follow:

# round(number, ndigits=None)
x = round(number = 5.76543, ndigits = 2)
print(x)
5.77

To get more information about a function, use the help() function.

We will now learn how to create our own function.

Imagine that we want to convert a temperature in Fahrenheit and converting it to Celsius. We could write:

fahrenheit_val = 99
celsius_val = ((fahrenheit_val - 32) * (5/9))

and for a second number we could just copy the line and rename the variables:

fahrenheit_val = 99
celsius_val = ((fahrenheit_val - 32) * (5/9))

fahrenheit_val2 = 43
celsius_val2 = ((fahrenheit_val2 - 32) * (5/9))

But we would be in trouble as soon as we had to do this more than a couple times. Cutting and pasting it is going to make our code get very long and very repetitive, very quickly. We’d like a way to package our code so that it is easier to reuse, a shorthand way of re-executing longer pieces of code. This is what functions are for.

Syntax

In python, a function is declared with the keyword def followed by its name, and the arguments inside parenthesis. The next block of code, corresponding to the content of the function, must be indented. The output is defined by the return keyword.

def explicit_fahr_to_celsius(temp):
    # Assign the converted value to a variable
    converted = ((temp - 32) * (5/9))
    # Return the value of the new variable
    return converted
    
def fahr_to_celsius(temp):
    # Return converted value more efficiently using the return
    # function without creating a new variable. This code does
    # the same thing as the previous function but it is more explicit
    # in explaining how the return command works.
    return ((temp - 32) * (5/9))
Figure 6.1: Function schema

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back. Let’s try running our function:

converted_temp = fahr_to_celsius(32)
print(converted_temp)

# or even,
print('freezing point of water:', fahr_to_celsius(32), 'C')
print('boiling point of water:', fahr_to_celsius(212), 'C')
0.0
freezing point of water: 0.0 C
boiling point of water: 100.0 C

We’ve successfully called the function that we defined, and we have access to the value that we returned.

Documentation

You can also add a description of the function directly after the function definition. It is the message that will be shown when running help(). As it can be along text over multiple lines, it is common to put it inside triple quotes """.

def fahr_to_celsius(temp):
    """Returns the temperature in Celsius given a temperature in Fahrenheit.
    Parameters:
    temp (float): Temperature in Fahrenheit.
    """
    return ((temp - 32) * (5/9))

help(fahr_to_celsius)
Help on function fahr_to_celsius in module __main__:

fahr_to_celsius(temp)
    Returns the temperature in Celsius given a temperature in Fahrenheit.
    Parameters:
    temp (float): Temperature in Fahrenheit.

Arguments

You can have several arguments. They can be mandatory or optional. To make them optional, they need to have a default value assigned inside the function definition, like so:

def fahr_to_celsius(temp, inverted = False):
    """Returns the temperature in Celsius given a temperature in Fahrenheit, 
    or the inverse (Celsius to Fahrenheit) if inversted is True.
    Parameters:
    temp (float): Temperature in Fahrenheit.
    inverted (bool, optional): Whether to convert from Celsius to Fahrenheit (True) or Fahrenheit to Celsius (False).
    """
    if inverted:
        new_temp = ((temp * (9/5)) + 32)
    else:
        new_temp = ((temp - 32) * (5/9))
    return new_temp

# Or more efficiently:
def fahr_to_celsius(temp, inverted = False):
    """Returns the temperature in Celsius given a temperature in Fahrenheit, 
    or the inverse (Celsius to Fahrenheit) if inversted is True.
    Parameters:
    temp (float): Temperature in Fahrenheit.
    inverted (bool, optional): Whether to convert from Celsius to Fahrenheit (True) or Fahrenheit to Celsius (False).
    """
    if inverted:
        return ((temp * (9/5)) + 32)
    else:
        return ((temp - 32) * (5/9))

help(fahr_to_celsius)
Help on function fahr_to_celsius in module __main__:

fahr_to_celsius(temp, inverted=False)
    Returns the temperature in Celsius given a temperature in Fahrenheit, 
    or the inverse (Celsius to Fahrenheit) if inversted is True.
    Parameters:
    temp (float): Temperature in Fahrenheit.
    inverted (bool, optional): Whether to convert from Celsius to Fahrenheit (True) or Fahrenheit to Celsius (False).
Note

The return keyword is present twice, but the function will stop as soon as it reaches the first return statement, so if inverted is True, the second return statement will never be reached.

The input temp is mandatory but inverted is optional as it has a default value of False. If we call the function without providing a value for inverted, it will be set to False by default.

fahr_to_celsius(32)
0.0
fahr_to_celsius(inverted = True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[59], line 1
----> 1 fahr_to_celsius(inverted = True)

TypeError: fahr_to_celsius() missing 1 required positional argument: 'temp'
Note

Reminder: if you provide the parameters in the exact same order as they are defined, you don’t have to name them. If you name the parameters you can switch their order. As good practice, put all required parameters first.

fahr_to_celsius(inverted = True, temp = 100)
212.0
fahr_to_celsius(100, True)
212.0

Output

If no return statement is given, then no output will be returned, but the function will be run.

def fahr_to_celsius(temp, inverted = False):
    """Returns the temperature in Celsius given a temperature in Fahrenheit, 
    or the inverse (Celsius to Fahrenheit) if inversted is True.
    Parameters:
    temp (float): Temperature in Fahrenheit.
    inverted (bool, optional): Whether to convert from Celsius to Fahrenheit (True) or Fahrenheit to Celsius (False).
    """
    print("We are inside the function, the value of temp is:", temp)
    if inverted:
        new_temp = ((temp * (9/5)) + 32)
    else:
        new_temp = ((temp - 32) * (5/9))
print(fahr_to_celsius(32))
We are inside the function, the value of temp is: 32
None

The output can be of any type. If you have a lot of things to return, you might want to return a list.

def multiple_of_3(list_of_numbers):
  """Returns the number that are multiple of 3."""
  multiples = []
  for num in list_of_numbers:
    if num % 3 == 0: 
      # num % 3 == 0 means that the remainder 
      # of num divided by 3 is 0, 
      # which means that num is a multiple of 3
      multiples.append(num)
  return multiples

multiple_of_3(range(1, 20, 2))
[3, 9, 15]
Note

This could be written as a one-liner.

def multiple_of_3(list_of_numbers):
  """Returns the number that are multiple of 3."""
  multiples = [num for num in list_of_numbers if num % 3 == 0]
  return multiples

multiple_of_3(range(1, 20, 2))
[3, 9, 15]

Variable Scope

In composing our temperature conversion function, we sometimes created a variable inside of the function, new_temp. We refer to such variables as local variables because they no longer exist once the function is done executing. If we try to access their values outside of the function, we will encounter an error:

print(new_temp)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[66], line 1
----> 1 print(new_temp)

NameError: name 'new_temp' is not defined

If you want to reuse the converted temperature having calculated it with fahr_to_celsius, you can store the result of the function call in a variable:

converted_temp = fahr_to_celsius(32)
print("Converted temp is:", converted_temp)
We are inside the function, the value of temp is: 32
Converted temp is: None

The variable converted_temp, being defined outside any function, is said to be global.

Interestingly, inside a function, one can read the value of such global variables:

def fahr_to_celsius(temp, inverted = False):
    if inverted:
        return ((temp * (9/5)) + 32)
    else:
        return ((temp - 32) * (5/9))

def print_temperatures():
    print('temperature in Fahrenheit was:', temp_fahr)
    print('temperature in Celsius was:', temp_celsius)

temp_fahr = 212.0
temp_celsius = fahr_to_celsius(212.0)

print_temperatures()
temperature in Fahrenheit was: 212.0
temperature in Celsius was: 100.0

It is not recommended to modify the value of a global variable inside a function, as it can lead to unexpected behavior and make the code harder to debug. If you need to modify a global variable, it is better to return the modified value from the function and assign it to the global variable outside of the function.

Application on our data

Now that we know how to wrap bits of code up in functions, we can make our inflammation analysis easier to read and easier to reuse. First, let’s make a visualize() function that generates our plots:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

def visualize(filename):
    """Visualizes the average, maximum and minimum inflammation per day for a given file.
    Parameters: 
    filename (str): The path of the file to visualize.
    """ 
    data = pd.read_csv(filename, index_col=0)
    fig = plt.figure()

    axes1 = fig.add_subplot(1, 3, 1)
    axes2 = fig.add_subplot(1, 3, 2)
    axes3 = fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.set_xlabel('Days')
    axes1.set_xticks([])
    axes1.plot(data.mean(axis=0))

    axes2.set_ylabel('max')
    axes2.set_xlabel('Days')
    axes2.set_xticks([])
    axes2.plot(data.max(axis=0))

    axes3.set_ylabel('min')
    axes3.plot(data.min(axis=0))
    axes3.set_xlabel('Days')
    axes3.set_xticks([])

    fig.tight_layout()
    plt.show()

visualize("data/inflammation-01.csv")

We did not use the return keyword. In Python, functions are not required to include a return statement and can be used for the sole purpose of grouping together pieces of code that conceptually do one thing. In such cases, function names usually describe what they do, e.g. visualize(). This is a common practice in Python, and it is not considered bad style. In fact, it can make the code more readable and easier to understand.

By giving our function human-readable names, we can more easily read and understand what is happening in the for loop. Even better, if at some later date we want to use either of those pieces of code again, we can do so in a single line.

Moreover, when writing code, it is important to keep in mind that other people (or even yourself in the future) will have to read and understand it. Therefore, it is good practice to write code that is easy to read and understand. This can be achieved by using descriptive variable and function names, adding comments to explain what the code is doing, and organizing the code in a logical way. Both documentation and a programmer’s coding style combine to determine how easy it is for others to read and understand the programmer’s code.

Exercise

ImportantExercise

Create a function that prints a message if the data in a file looks suspicious, using the same criteria as before (suspicious looking maxima and minima that add up to zero). No output needs to be returned, just print the message. It needs to also print the name of the file that is being checked.

For example the output of detect_problems("data/inflammation-01.csv") could be:

Checking file: data/inflammation-01.csv
Suspicious looking maxima!

Use it on one of the inflammation files, and then on all of them using a loop.

import pandas as pd
import glob

def detect_problems(filename):
    """Checks if the data in a file has suspicious looking maxima and minima that add up to zero. 
    No output is returned, just a message is printed. 
    Parameters: 
    filename (str): The path of the file to visualize.
    """ 
    data = pd.read_csv(filename, index_col=0)
    print("Checking file:", filename)

    if data.max(axis=0).iloc[0] == 0 and data.max(axis=0).iloc[20] == 20:
        print('Suspicious looking maxima!')
    elif data.min(axis=0).sum() == 0:
        print('Minima add up to zero!')
    else:
        print('Seems OK!')

detect_problems("data/inflammation-01.csv")
Checking file: data/inflammation-01.csv
Suspicious looking maxima!
filenames = sorted(glob.glob('data/inflammation*.csv'))
for filename in filenames:
    detect_problems(filename)
Checking file: data/inflammation-01.csv
Suspicious looking maxima!
Checking file: data/inflammation-02.csv
Suspicious looking maxima!
Checking file: data/inflammation-03.csv
Minima add up to zero!
Checking file: data/inflammation-04.csv
Suspicious looking maxima!
Checking file: data/inflammation-05.csv
Suspicious looking maxima!
Checking file: data/inflammation-06.csv
Suspicious looking maxima!
Checking file: data/inflammation-07.csv
Suspicious looking maxima!
Checking file: data/inflammation-08.csv
Minima add up to zero!
Checking file: data/inflammation-09.csv
Suspicious looking maxima!
Checking file: data/inflammation-10.csv
Suspicious looking maxima!
Checking file: data/inflammation-11.csv
Minima add up to zero!
Checking file: data/inflammation-12.csv
Suspicious looking maxima!

Command-Line programs

Sooner or later we might want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that in an efficient way, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a dataset and prints the average inflammation per patient.

If you remember from the first lesson, it can be used either interactively in the terminal, or by writing a shell script (a file that contains shell commands) and running it.

Note

In this lesson we are switching from typing commands in a Python interpreter to typing commands in a shell terminal window (such as bash). When you see a $ in front of a command that tells you that you are in the shell, when you see a >>> it tells you are in the Python interpreter.

Creating a python script

For this we need to create a Python script, which is a file that contains Python code. We can create a Python script using any text editor. The file should have a .py extension to indicate that it is a Python script.

Open a new file in your text editor, and write the following code:

#!/usr/bin/env python3

import pandas as pd

def main(filename, min = True, mean = True, max = True):
    data = pd.read_csv(filename, index_col=0)
    stats = {}

    if min:
        stats['min'] = data.min(axis=0).min()
    if mean:
        stats['mean'] = data.mean(axis=0).mean()
    if max:
        stats['max'] = data.max(axis=0).max()

    for key, value in stats.items():
        print(f"{key}: {value}")

Save it under the name inflammation_stats.py, in ~/Desktop/swc-python/code.

Note

F-strings are a way to format strings in Python. They are defined by prefixing a string with the letter f or F, and they allow you to include expressions inside curly braces {} that will be evaluated at runtime and included in the string. They were introduced in Python 3.6 and provide a more concise and readable way to format strings. See the official documentation for more information on hwo to format strings.

You should already be in the ~/Desktop/swc-python/ directory. If not you can navigate to it using the cd command in the terminal. Then, you can run the script using the following command:

cd ~/Desktop/swc-python

You can verify you are in the right directory by running pwd (print working directory):

pwd

You can now run the python script:

./code/inflammation_stats.py 
# or
python code/inflammation_stats.py 

It should not output anything, as we only provided a function but did not call it. We need to call the main() function and provide it with the name of the file we want to analyze. We can do that by adding the following lines at the end of our script:

main("data/inflammation-01.csv")

Now, running ./code/inflammation_stats.py in command line should output the average, minimum and maximum inflammation per patient for the file data/inflammation-01.csv.

User-defined input

There are some interesting ways to get input from the user:

  • input() receives input from the keyboard. This means that the input is defined while the python script is being executed.
  • sys.argv takes arguments provided in command line after the name of the program. This means that the input is defined before the python script is being executed.
  • argparse is similar to sys.argv, with the advantage of being able to give specific names to arguments.

The sys.argv list contains the command-line arguments passed to the script, and sys.argv[1] refers to the first argument (the filename in this case).

input

Python stops executing when it comes to the input() function, and continues when the user has given some input.

In the inflammation_stats.py file , write the following after the end of the main() function:

filename = input("Enter filename: ")
print("Filename is: " + filename)

main(filename)

Then in the terminal, run:

./code/inflammation_stats.py 
# or 
python code/inflammation_stats.py 

You should be asked, in command line, to enter a filename. When you write it (e.g. data/inflammation-01.csv), and press Enter, it should run the main() function with the given file.

Enter filename: data/inflammation-01.csv
Filename is: data/inflammation-01.csv
min: 0
mean: 6.14875
max: 20

sys.argv

To use sys.argv you need to import a module called sys. It is part of the standard python library, so you should not have to install anything in particular.

Modify the same python script, to add at the beginning of the script:

import sys 

and substitute the end of the script with:

print("Filename is: " + sys.argv[1])

main(sys.argv[1])

Then in the terminal, run:

./code/inflammation_stats.py data/inflammation-01.csv
# or 
python code/inflammation_stats.py data/inflammation-01.csv

Arguments are given in command line, separated by [space].

Filename is: data/inflammation-01.csv
min: 0
mean: 6.14875
max: 20
Note

What is the type of sys.argv? Remember that in python index begins at 0. What do you think is sys.argv[0]? Verify!

Also, what happens if you run ./code/inflammation_stats.py data/inflammation-01.csv data/inflammation-02.csv?

argparse

Just like for sys, you need to import argparse.

Modify the same python script, to add at the beginning of the script:

import argparse 

and substitute the end of the script with:

parser = argparse.ArgumentParser()
parser.add_argument('--filename', action="store", required=True)
args = parser.parse_args()
print("Filename is: " + args.filename)

main(args.filename)

Then in the terminal, run:

./code/inflammation_stats.py --filename data/inflammation-01.csv

Arguments are given in command line, but they have specific names.

Note

argparse is a very useful module when creating programs! You can easily specify the expected type of argument, whether it is optional or not, and create a help for your script. Check their tutorial for more information.

Exercise

ImportantExercise

Modify the script to also take --min, --mean and --max as command line arguments. If the argument is given, the corresponding statistic will be calculated and printed. By default, it should not calculate any of the statistics, the user has to specify that in command line.

If the user runs ./code/inflammation_stats_argparse.py --filename data/inflammation-01.csv --min --max, it should only calculate and print the minimum and maximum inflammation per patient, but not the mean.

You can use the action="store_true" option in add_argument() to create a boolean variable for each statistic.
For example:

parser.add_argument('--min', action="store_true")

This line will create a variable called args.min, that will be set to False by default. When the user provides the --min argument in command line, args.min will be set to True.

You should successfully run your script with the following commands:

./code/inflammation_stats.py --filename data/inflammation-01.csv --min
./code/inflammation_stats.py --filename data/inflammation-01.csv --max
./code/inflammation_stats.py --filename data/inflammation-01.csv --min --max
./code/inflammation_stats.py --filename data/inflammation-01.csv --min --max --mean
#!/usr/bin/env python3

import argparse
import pandas as pd

def main(filename, min = True, mean = True, max = True):
    data = pd.read_csv(filename, index_col=0)
    stats = {}

    if min:
        stats['min'] = data.min(axis=0).min()
    if mean:
        stats['mean'] = data.mean(axis=0).mean()
    if max:
        stats['max'] = data.max(axis=0).max()

    for key, value in stats.items():
        print(f"{key}: {value}")

# Deal with arguments
parser = argparse.ArgumentParser()
parser.add_argument('--filename', action="store", required=True)
parser.add_argument('--min', action="store_true")
parser.add_argument('--max', action="store_true")
parser.add_argument('--mean', action="store_true")
args = parser.parse_args()
print("Filename is: " + args.filename)

main(args.filename, min=args.min, mean=args.mean, max=args.max)

Then in the command line, run:

./code/inflammation_stats.py --filename data/inflammation-01.csv --min
./code/inflammation_stats.py --filename data/inflammation-01.csv --max
./code/inflammation_stats.py --filename data/inflammation-01.csv --min --max
./code/inflammation_stats.py --filename data/inflammation-01.csv --min --max --mean

Handle multiple filenames

The next step is to teach our program how to handle multiple files.

We want our program to process each file separately, so we need a loop that executes once for each filename. If we specify the files on the command line, the filenames will be in args.filename. Fortunately, argparse has a built-in way to handle an unknown number of arguments. We can use the nargs='+' option in add_argument() to specify that we want to accept one or more filenames as input. This will create a list of filenames in args.filename, which we can then loop through.

Here’s how it can be done:

#!/usr/bin/env python3

import argparse
import pandas as pd

def main(filename, min = True, mean = True, max = True):
    for file in filename:
        print("Processing file:", file)
        data = pd.read_csv(file, index_col=0)
        stats = {}

        if min:
            stats['min'] = data.min(axis=0).min()
        if mean:
            stats['mean'] = data.mean(axis=0).mean()
        if max:
            stats['max'] = data.max(axis=0).max()

        for key, value in stats.items():
            print(f"{key}: {value}")

# Deal with arguments
parser = argparse.ArgumentParser()
parser.add_argument('--filename', nargs='+', action="store", required=True)
parser.add_argument('--min', action="store_true")
parser.add_argument('--max', action="store_true")
parser.add_argument('--mean', action="store_true")
args = parser.parse_args()

main(args.filename, min=args.min, mean=args.mean, max=args.max)

Then in command line, you can run:

./code/inflammation_stats.py --filename data/inflammation-01.csv --min --max
./code/inflammation_stats.py --filename data/inflammation-01.csv data/inflammation-02.csv --min --max
./code/inflammation_stats.py --filename data/inflammation-0*.csv --min --max
Note

What is the type of args.filename when using nargs='+'? What does it contain?

Warning

Documentation is very important as a user does not know what the expected input is, and what the output will be. It is also important to specify how the program should be run, and to give examples of usage. This is especially important when creating command-line programs, as users might not be familiar with how to use them.

Testing the code

When coding, a good practice is to test your code on small data files before running it on large datasets. This allows you to check that your code is working as expected and to catch any errors or bugs before they cause problems with larger datasets.

Indeed, your code might not have any syntax errors, but it might not be doing what you think it is doing. For example, you might think that your code is calculating the mean of a dataset, but in reality, it is calculating the sum. If you run your code on a large dataset, you might not notice this error until it is too late.

Here we can test our code on a smaller file that has the same structure as the larger files, but only contains a few lines of data. This way, we can easily check that our code is calculating the mean correctly for each line, and that it is doing what we expect it to do.

The file data/small-01.csv contains the following data:

Day1,Day2,Day3
Patient1,0,0,1
Patient2,0,1,2

We can run:

./code/inflammation_stats.py --filename data/small-01.csv --min --max --mean

There are two more files small-02.csv and small-03.csv that you can use to test your code.

It is also important to test your code on edge cases, which are cases that are at the extreme ends of the input space. For example, if your code is supposed to handle missing values, you should test it on a file that contains missing values to make sure that it is handling them correctly.

Even if your documentation is very clear, the user might not behave as you expect them to, and might provide input that is not in the format you expected. For example, if your code is supposed to take a filename as input, the user might provide a filename that does not exist, or a filename that is in a different format than what you expected.

It is important to test your code on such edge cases to make sure that it is handling them correctly and providing useful error messages to the user. But how can we provide useful error messages to the user? This is what we will see in the next section.

Errors and Exceptions

Every programmer encounters errors, both those who are just beginning, and those who have been programming for years. Encountering errors and exceptions can be very frustrating at times, and can make coding feel like a hopeless endeavour. However, understanding what the different types of errors are and when you are likely to encounter them can help a lot. Once you know why you get certain types of errors, they become much easier to fix.

Tracebacks

Errors in Python have a very specific form, called a traceback. Let’s examine one by providing a filename that does not exist:

./code/inflammation_stats.py --filename data/small-06.csv --min --max --mean
Processing file: data/small-06.csv
Traceback (most recent call last):
  File "/Volumes/projects/PCa_SingleCell/utils/workshop/python-intro/./code/inflammation_stats.py", line 31, in <module>
    main(args.filename, min=args.min, mean=args.mean, max=args.max)
  File "/Volumes/projects/PCa_SingleCell/utils/workshop/python-intro/./code/inflammation_stats.py", line 9, in main
    data = pd.read_csv(file, index_col=0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gilbartv/miniconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gilbartv/miniconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gilbartv/miniconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gilbartv/miniconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/Users/gilbartv/miniconda3/lib/python3.11/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
             ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'data/small-06.csv'

Python tells us the category or type of error (in this case, it is an FileNotFoundError) and a more detailed error message (in this case, it says [Errno 2] No such file or directory: 'data/small-06.csv').

This particular traceback has many levels. The traceback also shows us the exact line of code where the error occurred, and the sequence of function calls that led to that line of code being executed.

The lines in our code that caused the error are main(args.filename, min=args.min, mean=args.mean, max=args.max), and then pd.read_csv(file, index_col=0). Indeed main() is the function that we called in command line, and inside main(), the error occurred when trying to read the file with pd.read_csv(). It gives more or the subsequent calls that were made inside the pandas library to get to the line of code that caused the error, but we can ignore those for now.

Note

The length of the error message does not reflect severity, rather, it indicates that your program called many subsequent functions before it encountered the error.

Tip

One reason for receiving this FileNotFoundError error is that you specified an incorrect path to the file. For example, if I am currently in a folder called myproject, and I have a file in myproject/writing/myfile.txt, but I try to open myfile.txt, this will fail. The correct path would be writing/myfile.txt. It is also possible that the file name or its path contains a typo.

If you encounter an error and don’t know what it means, it is still important to read the traceback closely. That way, if you fix the error, but encounter a new one, you can tell that the error changed. Additionally, sometimes knowing where the error occurred is enough to fix it, even if you don’t entirely understand the message.

Type of Exceptions

If you do encounter an error you don’t recognize, try looking at the official documentation on errors also called Exceptions in Python. However, note that you may not always be able to find the error there, as it is possible to create custom errors. In that case, hopefully the custom error message is informative enough to help you figure out what went wrong.

Here is a table of some of the built-in exceptions in python.

Exception Description
IndexError Raised when the index of a sequence is out of range.
KeyError Raised when a key is not found in a dictionary.
KeyboardInterrupt Raised when the user hits the interrupt key (Ctrl+c or Delete).
NameError Raised when a variable is not found in the local or global scope.
TypeError Raised when a function or operation is applied to an object of an incorrect type.
StatementError Raised when statements are in an order or contain characters not expected by the programming language.
ValueError Raised when a function receives an argument of the correct type but of an incorrect value.
RuntimeError Raised when an error occurs that do not belong to any specific exceptions.
Exception Base class of exceptions.

Exercise

ImportantExercise

Try the following codes and understand what the error is:

# Code A
def print_message(day):
    messages = [
        'Hello, world!',
        'Today is Tuesday!',
        'It is the middle of the week.',
        'Today is Donnerstag in German!',
        'Last day of the week!',
        'Hooray for the weekend!',
        'Aw, the weekend is almost over.'
    ]
    print(messages[day])

def print_sunday_message():
    print_message(7)

print_sunday_message()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[73], line 17
     14 def print_sunday_message():
     15     print_message(7)
---> 17 print_sunday_message()

Cell In[73], line 15, in print_sunday_message()
     14 def print_sunday_message():
---> 15     print_message(7)

Cell In[73], line 12, in print_message(day)
      2 def print_message(day):
      3     messages = [
      4         'Hello, world!',
      5         'Today is Tuesday!',
   (...)
     10         'Aw, the weekend is almost over.'
     11     ]
---> 12     print(messages[day])

IndexError: list index out of range
# Code B
def some_function()
    msg = 'hello, world!'
    print(msg)
     return msg
  Cell In[74], line 2
    def some_function()
                       ^
SyntaxError: expected ':'
# Code C
def some_function():
    msg = 'hello, world!'
    print(msg)
        return msg
  Cell In[75], line 5
    return msg
    ^
TabError: inconsistent use of tabs and spaces in indentation
# Code D
print(a)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[76], line 2
      1 # Code D
----> 2 print(a)

NameError: name 'a' is not defined
# Code E
for number in range(10):
    total = total + number
print('The total is:', total)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[77], line 3
      1 # Code E
      2 for number in range(10):
----> 3     total = total + number
      4 print('The total is:', total)

NameError: name 'total' is not defined
# Code F
Count = 0
for number in range(10):
    count = count + number
print('The count is:', count)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[78], line 4
      2 Count = 0
      3 for number in range(10):
----> 4     count = count + number
      5 print('The count is:', count)

NameError: name 'count' is not defined

Code A raises an IndexError because it is trying to access the 8th element of the messages list, which only has 7 elements (indexed from 0 to 6), meaning we tried to access a list index that did not exist.

Code B raises a SyntaxError because there is a missing colon at the end of the function definition line (def some_function()). Additionally, there is an indentation error in the return statement, which should be indented to the same level as the print statement, raising an IndentationError.

Code C raises a TabError because there is a mix of tabs and spaces in the indentation. In Python, it is important to use either tabs or spaces consistently for indentation, but not both. This can be fixed by replacing all tabs with spaces or vice versa.

Code D raises a NameError because the variable a is not defined anywhere in the code. To fix this, you would need to define a before trying to print it, for example by assigning it a value like a = 10. This can happend when you have a typo in the variable name, or when you forget to define a variable before using it, or when you want to print a string but forget to put the quotes around it. It is important to check that all variables are defined and spelled correctly before using them in your code.

Code E raises a NameError because the variable total is not defined before it is used in the loop. To fix this, you would need to initialize total to a value (e.g., total = 0) before the loop starts.

Code F raises a NameError because the variable count is not defined before it is used in the loop. To fix this, you would need to initialize count to a value (e.g., count = 0) before the loop starts. There is a case sensitivity issue with the variable name Count, which should be changed to count to match the variable used in the loop and print statement.

Defensive Programming

Exceptions Handling

It is possible to handle errors (in python, they are also called exceptions), using the raise keyword followed by an exception type and an error message. This allows you to create custom error messages that are more informative than the default error messages provided by Python:

x = "hello"
if not isinstance(x, int):
    raise TypeError("Only integers are allowed") 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[79], line 3
      1 x = "hello"
      2 if not isinstance(x, int):
----> 3     raise TypeError("Only integers are allowed") 

TypeError: Only integers are allowed
x = 12
if not isinstance(x, int):
    raise TypeError("Only integers are allowed") 

If an error is raised, the code will crash and stop executing. However, sometimes we want to handle the error and continue executing the code. It is also possible to use the following statements to handle exceptions in a more flexible way:

  • try to test a block of code for errors
  • except to handle the error
  • else to excute code if there is no error
  • finally to excute code, regardless of the result of the try and except blocks
# The try block will generate an exception, because some_undefined_variable is not defined:
try:
  print(some_undefined_variable)
except:
  print("Oops... Something went wrong") 
Oops... Something went wrong
# Without the try block, the program will crash and raise an error:
print(some_undefined_variable)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[82], line 2
      1 # Without the try block, the program will crash and raise an error:
----> 2 print(some_undefined_variable)

NameError: name 'some_undefined_variable' is not defined
try:
  print(some_undefined_variable)
except:
  print("Oops... Something went wrong") 
else:
  print("Nothing went wrong") 
finally:
  print("The 'try except' is finished") 
Oops... Something went wrong
The 'try except' is finished

You can use the Exceptions types to be more specific about the type of exception occurring.

try:
  print(some_undefined_variable)
except NameError:
  print("A variable is not defined") 
except:
  print("Oops... Something went wrong") 
else:
  print("Nothing went wrong") 
finally:
  print("The 'try except' is finished") 
A variable is not defined
The 'try except' is finished

You can also use the Exceptions types to throw an exception if a condition occurs, combining it with the raise keyword.

x = "hello"
try:
  if not isinstance(x, int):
    raise TypeError("Only integers are allowed") 
  if x < 0:
    raise ValueError("Sorry, no numbers below zero") 
  print(x, "is a positive integer.")
except NameError:
  print("A variable is not defined") 
else:
  print("Nothing went wrong") 
finally:
  print("The 'try except' is finished") 
The 'try except' is finished
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[85], line 4
      2 try:
      3   if not isinstance(x, int):
----> 4     raise TypeError("Only integers are allowed") 
      5   if x < 0:
      6     raise ValueError("Sorry, no numbers below zero") 

TypeError: Only integers are allowed
x = -1
try:
  if not isinstance(x, int):
    raise TypeError("Only integers are allowed") 
  if x < 0:
    raise ValueError("Sorry, no numbers below zero") 
  print(x, "is a positive integer.")
except NameError:
  print("A variable is not defined") 
else:
  print("Nothing went wrong") 
finally:
  print("The 'try except' is finished") 
The 'try except' is finished
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[86], line 6
      4     raise TypeError("Only integers are allowed") 
      5   if x < 0:
----> 6     raise ValueError("Sorry, no numbers below zero") 
      7   print(x, "is a positive integer.")
      8 except NameError:

ValueError: Sorry, no numbers below zero
x = 1
try:
  if not isinstance(x, int):
    raise TypeError("Only integers are allowed") 
  if x < 0:
    raise ValueError("Sorry, no numbers below zero") 
  print(x, "is a positive integer.")
except NameError:
  print("A variable is not defined") 
else:
  print("Nothing went wrong") 
finally:
  print("The 'try except' is finished") 
1 is a positive integer.
Nothing went wrong
The 'try except' is finished

Exercise

ImportantExercise

Let’s make our previous function even better by adding some exception handling. Raise a TypeError if the input filename is not a string. To verify that a variable is of a certain type, you can use the isinstance() function. For example, isinstance(x, str) will return True if x is a string, and False otherwise.

With the input given below, the output and errors should be:

main(filename = 5474)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[89], line 1
----> 1 main(filename = 5474)

Cell In[88], line 5, in main(filename, min, mean, max)
      3 def main(filename, min = True, mean = True, max = True):
      4     if not isinstance(filename, str):
----> 5         raise TypeError(f"Filename must be a string, {filename} is of type {type(filename)}")
      6     print("Processing file:", filename)
      7     data = pd.read_csv(filename, index_col=0)

TypeError: Filename must be a string, 5474 is of type <class 'int'>
main(filename = "data/inflammation-01.csv")
Processing file: data/inflammation-01.csv
min: 0
mean: 6.14875
max: 20

The function to modify is:

import pandas as pd

def main(filename, min = True, mean = True, max = True):
    print("Processing file:", filename)
    data = pd.read_csv(filename, index_col=0)
    stats = {}
    if min:
        stats['min'] = data.min(axis=0).min()
    if mean:
        stats['mean'] = data.mean(axis=0).mean()
    if max:
        stats['max'] = data.max(axis=0).max()
    for key, value in stats.items():
        print(f"{key}: {value}")
import pandas as pd

def main(filename, min = True, mean = True, max = True):
    if not isinstance(filename, str):
        raise TypeError(f"Filename must be a string, {filename} is of type {type(filename)}")
    print("Processing file:", filename)
    data = pd.read_csv(filename, index_col=0)
    stats = {}
    if min:
        stats['min'] = data.min(axis=0).min()
    if mean:
        stats['mean'] = data.mean(axis=0).mean()
    if max:
        stats['max'] = data.max(axis=0).max()
    for key, value in stats.items():
        print(f"{key}: {value}")


main(filename = 5474)
main(filename = "data/inflammation-01.csv")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[91], line 19
     15     for key, value in stats.items():
     16         print(f"{key}: {value}")
---> 19 main(filename = 5474)
     20 main(filename = "data/inflammation-01.csv")

Cell In[91], line 5, in main(filename, min, mean, max)
      3 def main(filename, min = True, mean = True, max = True):
      4     if not isinstance(filename, str):
----> 5         raise TypeError(f"Filename must be a string, {filename} is of type {type(filename)}")
      6     print("Processing file:", filename)
      7     data = pd.read_csv(filename, index_col=0)

TypeError: Filename must be a string, 5474 is of type <class 'int'>

Debugging

You will have to learn how to debug your code at some point, and it is a very important skill to have. Debugging is the process of finding and fixing errors in your code. It can be very frustrating at times, but it is also very rewarding when you finally find the bug and fix it.

Here are some tips on how to debug:

  • Find the line that raises the error.
  • Understand the error message. Read the error message carefully and try to understand what it is telling you. The error message will often give you a clue about what went wrong and where to look for the bug.
  • Read the code carefully. Try to understand what the code is doing and how it is supposed to work, what each variable is supposed to contain, and what the expected output is.
  • Print the values of variables. If you are not sure what a variable is supposed to contain, print it out and check its value. This can help you understand what is going wrong and where the bug is.
  • Comment out code. If we have a long program and we don’t know where the problem is, we can comment out parts of the code and run it to see if the problem goes away. This can help us narrow down where the problem is.
  • Explain the code to someone else. Sometimes, explaining the code to someone else can help us understand it better and spot the bug. If you don’t have someone nearby to share your problem description with, get a rubber duck! The idea is that by explaining the code to the duck, you will be forced to think about it more deeply and spot the bug.
  • Take a break, and come back to the problem with fresh eyes. This is especially useful when we have been staring at the same code for a long time, and we can’t figure out what’s wrong. Sometimes, taking a break and coming back to the problem later can help us see things that we missed before.

When coding, to avoid debugging here are some good practices to follow: - Test with simplified data. Before doing statistics on a real data set, we should try calculating statistics for a single record, for two identical records, for two records whose values are one step apart, or for some other case where we can calculate the right answer by hand. - Run your code often. If you write a long program and then run it, you might find that it doesn’t work, but you won’t know where the problem is. If you run your code often, you can catch errors early and fix them before they become too difficult to find. - Test a simplified case. If our program is supposed to simulate the effects of climate change on speciation, our first test should hold temperature, precipitation, and other factors constant. Then, we should introduce one factor at a time, and check that the results are consistent with our expectations. If we can’t predict what should happen when we change a single factor, it’s unlikely that we’ll be able to figure out what’s going wrong when we change many factors at once. - Change one thing at a time. If we change many things at once, it’s hard to know which change caused the problem. If we change one thing at a time, we can quickly figure out what’s causing the problem. - Compare to an oracle. A test oracle is something whose results are trusted, such as experimental data, an older program, or a human expert. We use test oracles to determine if our new program produces the correct results. If we have a test oracle, we should store its output for particular cases so that we can compare it with our new results as often as we like without re-running that program. - Check conservation laws. If we are analyzing patient data, the number of records should either stay the same or decrease as we move from one analysis to the next (since we might throw away outliers or records with missing values). If “new” patients start appearing out of nowhere as we move through our pipeline, it’s probably a sign that something is wrong. - Visualize. Data analysts frequently use simple visualizations to check both the science they’re doing and the correctness of their code.

If we train ourselves to avoid making some kinds of mistakes, to break our code into modular, testable chunks, and to turn every assumption (or mistake) into an assertion, it will actually take us less time to produce working programs, not more.

Final tips and resources

Here are a couple of tips:

  • Leave comments (think of your future self)
  • Be consistent (quotes, indents…)
  • Break down one complex task into lots of (easy) small tasks
  • When using functions you are not comfortable with, verify the output and make sure it does what you expect in with small examples
  • Don’t re-invent the wheel, for common tasks, it’s likely that a function already exists
  • Read the documentation when using a new package or function
  • Google It! Use the correct programming vocabulary to increase your chances of finding an answer. If you don’t find anything, try wording it differently.
  • Prompt it to AI! It works generally well to explain a code, and for small tasks using famous packages.
  • The easiest way to learn is by example, so follow a tutorial with the example data, and then try to apply it to your own

You can follow some free tutorials on:

References

Here are some references and ressources that greatly inspired this class :