# round(number, ndigits=None)
= round(number = 5.76543, ndigits = 2)
x print(x)
5.77
At the end of this class, you will be able to:
Remembering some of lesson 1.
A function stores a piece of code that performs a certain task, and that gets run when called. It takes some data as input (parameters that are required or optional), and returns an output (that can be of any type).
We already learned how to run a predefined function in the last lesson. You need to write its name followed by parenthesis. Parameters are added inside the parenthesis as follow:
# round(number, ndigits=None)
= round(number = 5.76543, ndigits = 2)
x print(x)
5.77
To get more information about a function, use the help()
function.
We will now learn how to create our own function.
In python, a function is declared with the keyword def
followed by its name, and the arguments inside parenthesis. The next block of code, corresponding to the content of the function, must be indented. The output is defined by the return
keyword.
def hello(name):
"""Presenting myself."""
= "Hello, my name is {0}.".format(name)
presentation return presentation
= hello(name = "Valentine")
text print(text)
Hello, my name is Valentine.
As you may have noticed, you can also add a description of the function directly after the function definition. It is the message that will be shown when running help()
. As it can be along text over multiple lines, it is common to put it inside triple quotes """
.
help(hello)
Help on function hello in module __main__:
hello(name)
Presenting myself.
You can have several arguments. They can be mandatory or optional. To make them optional, they need to have a default value assigned inside the function definition, like so:
def hello(name, french = True):
"""Presenting myself."""
if french:
= "Bonjour, je m'appelle {0}."
presentation else:
= "Hello, my name is {0}."
presentation return presentation.format(name)
The parameter name
is mandatory, but french
is optional.
"Valentine") hello(
"Bonjour, je m'appelle Valentine."
= False) hello(french
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[7], line 1 ----> 1 hello(french = False) TypeError: hello() missing 1 required positional argument: 'name'
Reminder: if you provide the parameters in the exact same order as they are defined, you don’t have to name them. If you name the parameters you can switch their order. As good practice, put all required parameters first.
= False, name = "Valentine") hello(french
'Hello, my name is Valentine.'
"Valentine", False) hello(
'Hello, my name is Valentine.'
If no return
statement is given, then no output will be returned, but the function will still be run.
def hello(name):
"""Presenting myself."""
print("We are inside the 'hello()' function.")
= "Hello, my name is {0}.".format(name) presentation
print(hello("Valentine"))
We are inside the 'hello()' function.
None
The output can be of any type. If you have a lot of things to return, you might want to return a list
or a dict
for example.
def multiple_of_3(list_of_numbers):
"""Returns the number that are multiple of 3."""
= []
multiples for num in list_of_numbers:
if num % 3 == 0:
multiples.append(num)return multiples
range(1, 20, 2)) multiple_of_3(
[3, 9, 15]
This could be written as a one-liner.
def multiple_of_3(list_of_numbers):
"""Returns the number that are multiple of 3."""
= [num for num in list_of_numbers if num % 3 == 0]
multiples return multiples
range(1, 20, 2)) multiple_of_3(
[3, 9, 15]
Write a function called nucl_freq
to compute nucleotide frequency of a sequence. Given a sequence as input, it outputs a dictionnary with keys being the nucleotides A
, T
, C
and G
, and values being their frequency in the sequence.
With the input given below, the output should be:
def ...
# Your code here
"ATTCCCGGGG") nucl_freq(
{'C': 0.3, 'T': 0.2, 'A': 0.1, 'G': 0.4}
It is possible to handle errors (in python, they are also called exceptions), using the following statements:
try
to test a block of code for errorsexcept
to handle the errorelse
to excute code if there is no errorfinally
to excute code, regardless of the result of the try and except blocks# The try block will generate an exception, because some_undefined_variable is not defined:
try:
print(some_undefined_variable)
except:
print("Oops... Something went wrong")
Oops... Something went wrong
# Without the try block, the program will crash and raise an error:
print(some_undefined_variable)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[17], line 2 1 # Without the try block, the program will crash and raise an error: ----> 2 print(some_undefined_variable) NameError: name 'some_undefined_variable' is not defined
try:
print(some_undefined_variable)
except:
print("Oops... Something went wrong")
else:
print("Nothing went wrong")
finally:
print("The 'try except' is finished")
Oops... Something went wrong
The 'try except' is finished
Here is a table of some of the built-in exceptions in python.
Exception | Description |
---|---|
IndexError |
Raised when the index of a sequence is out of range. |
KeyError |
Raised when a key is not found in a dictionary. |
KeyboardInterrupt |
Raised when the user hits the interrupt key (Ctrl+c or Delete). |
NameError |
Raised when a variable is not found in the local or global scope. |
TypeError |
Raised when a function or operation is applied to an object of an incorrect type. |
ValueError |
Raised when a function receives an argument of the correct type but of an incorrect value. |
RuntimeError |
Raised when an error occurs that do not belong to any specific exceptions. |
Exception |
Base class of exceptions. |
You can use them to be more specific about the type of exception occurring.
try:
print(some_undefined_variable)
except NameError:
print("A variable is not defined")
except:
print("Oops... Something went wrong")
else:
print("Nothing went wrong")
finally:
print("The 'try except' is finished")
A variable is not defined
The 'try except' is finished
You can also use them to throw an exception if a condition occurs, by using the raise
keyword.
= "hello"
x try:
if not isinstance(x, int):
raise TypeError("Only integers are allowed")
if x < 0:
raise ValueError("Sorry, no numbers below zero")
print(x, "is a positive integer.")
except NameError:
print("A variable is not defined")
else:
print("Nothing went wrong")
finally:
print("The 'try except' is finished")
The 'try except' is finished
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[20], line 4 2 try: 3 if not isinstance(x, int): ----> 4 raise TypeError("Only integers are allowed") 5 if x < 0: 6 raise ValueError("Sorry, no numbers below zero") TypeError: Only integers are allowed
Let’s make our previous function even better by adding some exception handling. Raise a TypeError
if the input is not a string. Raise a ValueError
if the input string contains something else than the nucleotides A, C, T, G
.
With the input given below, the output and errors should be:
def ...
# Your code here
5474)
nucl_freq("ATTCXCCGGGG")
nucl_freq("ATTCCCGGGG") nucl_freq(
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[21], line 15 12 freq[nucl] = seq.count(nucl)/n 13 return freq ---> 15 nucl_freq(5474) 16 nucl_freq("ATTCXCCGGGG") 17 nucl_freq("ATTCCCGGGG") Cell In[21], line 3, in nucl_freq(seq) 1 def nucl_freq(seq): 2 if not isinstance(seq, str): ----> 3 raise TypeError("Input must be a string.") 4 valid_nucl = {"A", "T", "C", "G"} 5 seq_nucl = set(seq) TypeError: Input must be a string.
There are some interesting ways to get input from the user:
input()
receives input from the keyboard. This means that the input is defined while the python script is being executed.sys.argv
takes arguments provided in command line after the name of the program. This means that the input is defined before the python script is being executed.argparse
is similar to sys.argv
, with the advantage of being able to give specific names to arguments.Python stops executing when it comes to the input()
function, and continues when the user has given some input.
In a file called username-1.py
, write the following:
= input("Enter username: ")
username print("Username is: " + username)
Then in the terminal, run:
#| eval: False
python username-1.py
You should be asked, in command line, to enter a username. When you write it, and press Enter, it gets printed.
Enter username: vgilbart
Username is: vgilbart
To use sys.argv
you need to import a module called sys
. It is part of the standard python library, so you should not have to install anything in particular.
In a file called username-2.py
, write the following:
import sys
print("Username is: " + sys.argv[1])
Then in the terminal, run:
#| eval: False
python username-2.py vgilbart
Arguments are given in command line, seperated by [space]
.
Username is: vgilbart
What is the type of sys.argv
? Remember that in python index begins at 0
. What do you think is sys.argv[0]
? Verify!
Also, what happens if you run python username-2.py valentine gilbart
?
Just like for sys
, you need to import argparse
.
In a file called username-3.py
, write the following:
import argparse
= argparse.ArgumentParser()
parser '--username', action="store")
parser.add_argument(
= parser.parse_args()
args print("Username is: " + args.username)
Then in the terminal, run:
#| eval: False
python username-3.py --username vgilbart
Arguments are given in command line, but they have specific names.
argparse
is a very useful module when creating programs! You can easily specify the expected type of argument, whether it is optional or not, and create a help for your script. Check their tutorial for more information.
The key function to work with files in open()
. It has two parameters file
and mode
.
# Write the correct path for you!
= 'exercise/data/example.fasta'
fasta_file = open(fasta_file, mode = 'r') f
The modes can be one of the following:
Mode | Description |
---|---|
r |
Opens a file for reading, error if the file does not exist (default) |
a |
Opens a file for appending, creates the file if it does not exist |
w |
Opens a file for writing, creates the file if it does not exist |
x |
Creates the specified file, returns an error if the file exists |
The open()
function returns a file object, which has a read()
method for reading the content of the file:
print(f.read())
>seq1
TTAGCTAAATAGCTAGCAAACTAGCTAGCTAAAAAAAAAACTAGCTAGCT
>seq2
ATGCCAGCCAGCCAGCCAGCCAGCTCGCTCGCTCGCCAGCCAGCTAGCTA
>seq3
CCGGGCGGTCGATGGATGGAGGGAGCGAGCGATCGATCGGTCGATCGGTG
>seq4
GATCGATCGATCTTTTTATCGATCGATTGTTCTTTCGATCGTTCTATCGA
>seq5
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAT
The parameter size =
can be added to specify the number of bytes (~ characters) to return.
# We need to re-open it because we have already parsed the whole file
= open(fasta_file, mode = 'r')
f print(f.read(2))
>s
You can return one line by using the .readline()
method. By calling it two times, you can read the two first lines:
= open(fasta_file, mode = 'r')
f print(f.readline())
print(f.readline())
>seq1
TTAGCTAAATAGCTAGCAAACTAGCTAGCTAAAAAAAAAACTAGCTAGCT
By looping through the lines of the file, you can read the whole file, line by line:
for i, line in enumerate(f):
print(i, line)
0 >seq2
1 ATGCCAGCCAGCCAGCCAGCCAGCTCGCTCGCTCGCCAGCCAGCTAGCTA
2 >seq3
3 CCGGGCGGTCGATGGATGGAGGGAGCGAGCGATCGATCGGTCGATCGGTG
4 >seq4
5 GATCGATCGATCTTTTTATCGATCGATTGTTCTTTCGATCGTTCTATCGA
6 >seq5
7 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAT
It is a good practice to close the file when you are done with it.
f.close()
In some cases, changes made to a file may not show until you close the file.
A common syntax to handle files that you might encounter is:
with open(fasta_file, 'r') as f:
print(f.readline())
>seq1
This code is equivalent to
= open(fasta_file, 'r')
f try:
print(f.readline())
finally:
f.close()
>seq1
The with
statement is an example of a context manager, i.e. it allows to allocate and release resources precisely, by cleaning up the resources once they are no longer needed.
To write into a file, you must have it open under a w
, a
mode.
Then, the method write()
can be used.
= "exercise/data/some_file.txt"
txt_file = open(txt_file, "w")
f "Woops! I have deleted the content!\n")
f.write(
f.close()
# Read the current content of the file
= open(txt_file, "r")
f print(f.read())
Woops! I have deleted the content!
Be very careful when opening a file in write
mode as you can delete its content without any way to retrieve the original file!
As you may have noticed, write()
returns the number of characters written. You can prevent it from being printed by assigning the return value to a variable that will not be used.
= open(txt_file, "a")
f = f.write("Now the file has more content!\n")
_
f.close()
# Read the current content of the file
= open(txt_file, "r")
f print(f.read())
Woops! I have deleted the content!
Now the file has more content!
You must specify a newline with the character:
\n
in Linus/MacOS\r\n
in Windows\r
in MacOS before XPython has a built-in package called os
, to interact with the operating system.
import os
print("Current working directory:", os.getcwd())
'../')
os.chdir(print("Current working directory:", os.getcwd())
Current working directory: /home/runner/work/python-intro/python-intro
Current working directory: /home/runner/work/python-intro
Here are some useful functions from the os
package.
Function | Description |
---|---|
getcwd() |
Returns the current working directory |
chdir() |
Change the current working directory |
listdir() |
Returns a list of the names of the entries in a directory |
mkdir() |
Creates a directory |
mkdirs() |
Creates a directory recursively |
A regular expression is a sequence of characters that forms a search pattern.
Python has a built-in package called re
, to work with regular expressions.
import re
= re.findall("hello", "hello world, hello you!")
x print(x)
['hello', 'hello']
Here are some useful functions from the re
package.
Function | Description |
---|---|
findall() |
Returns a list containing all matches |
search() |
Returns a Match object if there is a match anywhere in the string |
split() |
Returns a list where the string has been split at each match |
sub() |
Replaces one or many matches with a string |
To be more specific about a sequence search, regular expression uses metacharacters (i.e characters with sepecial meaning)
Metacharacter | Description | Example |
---|---|---|
[] | A set of characters | [a-m] |
\ | Signals a special sequence (can also be used to escape special characters) | \n |
. | Any character (except newline character) | he..o |
^ | Starts with | ^hello |
$ | Ends with | hello$ |
* | Zero or more occurrences | he.*o |
+ | One or more occurrences | he.+o |
? | Zero or one occurrences | he.?o |
{} | Exactly the specified number of occurrences | he.{2}o |
| | Either or | hello|bonjour |
() | Captures and group | hello (.+) \1 in which \1 correspond to what is being captured in (.+) |
To build and test a regex, you can use regex101.com, or any website equivalent, in which you can write your regex, and some string to test, to see how it matches.
A Match Object is an object containing information about the search and the result.
= re.search("hello .*",
x """
hello world
hello you
bonjour
""")
print(x)
<re.Match object; span=(1, 12), match='hello world'>
The Match object has methods used to retrieve information about the search, and the result:
.span()
returns a tuple containing the start and end positions of the match..group()
returns the part of the string where there was a matchprint(x.group())
hello world
From the list dna_sequences = ["ATGCGAATTCAC", "ATGAC", "ATGCCCGGGTAA", "ATGACGTACGTC", "ATGAGGGGTTCA"]
,
Extract all sequences that start with ATG
and end with AC
or AA
.
Extract all sequences that contain either G
or C
repeated three times consecutively.
You should get the following results:
Sequences starting with 'ATG' and ending with 'AC' or 'AA':
['ATGCGAATTCAC', 'ATGAC', 'ATGCCCGGGTAA']
Sequences containing 'G' or 'C' repeated three times consecutively:
['ATGCCCGGGTAA', 'ATGAGGGGTTCA']
Create a program, that you can run on command line as follow ./analyse_fasta.py path/to/fasta/file path/to/output/file
. It should:
Seq A C T G
seq1 0.1 0.2 0.3 0.4
seq2 0.4 0.3 0.2 0.1
...
To make this easier, consider that the sequences in the fasta file are only in one line.
You might make good use of the method str.strip()
.
You can take as input the file in exercise/data/example.fasta
you should get the same result as exercise/data/example.txt
.
A python package contains a set of function to perform specific tasks.
A package needs to be installed to your computer one time.
You can install a package with pip
. It should have been automatically installed with your python, to make sure that you have it you can run:
#| eval: false
# In Linux/MacOS
python -m pip --version
# In Windows
py -m pip --version
If it does not work, check out pip documentation.
To install a package called pandas, you must run:
#| eval: false
# In Linux/MacOS
python -m pip install pandas
# In Windows
py -m pip install pandas
To get more information about pip
, check out the full documentation.
Installing a package is done outside of the python interpreter, in command line in a terminal.
When you wish to use a package in a python script, you’ll need to import it, by writing inside of you script:
import pandas
Pandas is a package used to work with data sets, in order to easily clean, manipulate, explore and analyze data.
Pandas provides two types of classes for handling data:
Series
: a one-dimensional labeled array holding data of any type such as integers or strings. It is like a column in a table.# If nothing else is specified, the values are labeled with their index number (starting from `0`).
= pandas.Series([1, 7, 2], index = ["x", "y", "z"])
myseries print(myseries)
x 1
y 7
z 2
dtype: int64
DataFrame
: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns. It is like a table.= {
data "calories": [420, 380, 390],
"duration": [50, 40, 45]
}
= pandas.DataFrame(data)
df
print(df)
calories duration
0 420 50
1 380 40
2 390 45
You can also create a DataFrame from a file.
# Make sure this is the correct path for you! You are in the directory from where you execute the script.
= pandas.read_csv('exercise/data/sample.csv')
df
print(df)
You get access to the index and column names with:
df.columns df.index
You can rename index and column names:
= df.rename(index={0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5 : 'f'})
df
df.index
You can select rows:
# Select one row by its label
print(df.loc[['a']])
# Select one row by its index
print(df.iloc[[0]])
# Select several rows by labels
print(df.loc[['a','c']])
# Select one row by index
print(df.iloc[[0,2]])
You can select columns:
# Select one column by label
'Tissue'] # Series
df['Tissue']] # DataFrame
df[[
# Select several columns
'Gene','Expression_Level']]
df[[
# Select several columns by index
0,1]] df.iloc[:,[
You can select rows and columns as follows:
'b'], ['Gene','Expression_Level']] df.loc[[
You can filter based on a condition as follows:
'Expression_Level'] > 6] df[df[
To explore the data set, use the following methods:
df.info()
df.describe()
df.head()
#| eval: false
="Gene") df.sort_values(by
'Expression_Level'].mean()
df[
"Gene")[['Expression_Level']].mean() df.groupby(
To get more information on how to use pandas, check out:
Create a pandas DataFrame from the file containing the frequency of each nucleotide per sequences (exercise/data/example.txt
).
Make sure that df.index
contains the name of the sequences, and df.columns
contains the nucleotides.
Use pandas.melt()
(see the doc) to get the data in the following format:
nucl freq
Seq
seq1 A 0.46
seq2 A 0.20
seq3 A 0.16
seq4 A 0.18
seq5 A 0.26
seq1 T 0.22
seq2 T 0.12
...
Get the mean value of all nucleotide frequencies.
Get the mean value of frequencies per nucleotide.
Filter to remove values of seq1.
Recompute the mean value of frequencies per nucleotide.
Matplotlib is a package to create visualizations in Python widely used in science.
To shorten the name of the package when we call its functions, we can import it as follows:
import matplotlib.pyplot as plt
= pandas.read_csv('exercise/data/sample.csv')
df
# The data for GeneA and GeneB is extracted from the DataFrame 'df'
= df[df['Gene'] == 'GeneA']['Expression_Level']
serieA = df[df['Gene'] == 'GeneB']['Expression_Level']
serieB
# Create a new figure
= plt.figure()
fig
# Create a boxplot showing the expression levels of GeneA and GeneB
# List of series
plt.boxplot([serieA, serieB], =['GeneA', 'GeneB'])
labels
# Set the label for the x-axis
'Gene')
plt.xlabel(# Set the label for the y-axis
'Expression Level')
plt.ylabel(# Set the title of the plot
'Expression of Genes in Different Tissues')
plt.title(# Display the boxplot
plt.show()# Save the plot as a PNG file with a resolution of 300 dots per inch (dpi)
# The file will be saved in the specified location
'exercise/data/my-figure.png', dpi=300) fig.savefig(
The following code is equivalent.
# Create a new figure
= plt.subplots(1, figsize=(5, 4))
fig, ax
# List of series
ax.boxplot([serieA, serieB], =['GeneA', 'GeneB'])
labels'Gene')
ax.set_xlabel('Expression Level')
ax.set_ylabel('Expression of Genes in Different Tissues')
ax.set_title(
ax.legend()
plt.show()
# Save the plot as a PNG file with a resolution of 300 dots per inch (dpi)
# The file will be saved in the specified location
'exercise/data/my-figure-2.png', dpi=300) fig.savefig(
The first way of plotting is function-oriented, and the second is object-oriented. You might encounter both styles of coding.
Many visualizations are available (static, animated, interactive). For more information, check out:
Create a script that gets nucleotide frequency data from a file in the format of exercise/data/example.txt
, and visualizes it using Matplotlib and Pandas.
Your script should read the data, create a stacked bar chart showing the nucleotide frequencies for each sequence, and label the axes appropriately. Here’s the expected plot:
There are MANY packages available, here’s a short list of some that might interest you:
Package | Usage | Example of usage |
---|---|---|
BioPython | Computational molecular biology | Sequence handling, access to NCBI databases |
NumPy | Numerical arrays | Data manipulation, mathematical operations, linear algebra |
Seaborn | High-level interface for drawing plots | Data visualization, statistical graphics |
HTSeq | High throughput sequencing | Quality and coverage, counting reads, read alignment |
Scanpy | Single-Cell Analysis | Preprocessing, visualization, clustering |
SciPy | Mathematical algorithms | Clustering, ODE, Fourier Transforms |
Scikit-image | Image processing | Image enhancement, segmentation, feature extraction |
Scikit-learn | Machine learning | Classification, regression, clustering, dimensionality reduction |
TensorFlow and PyTorch | Deep learning | Neural networks, natural language processing, computer vision |
Here are a couple of tips:
You can follow some free tutorials on:
Finally, you should able to use Github Copilot (AI coding assistant), as it is free for students: https://education.github.com/benefits.
A python conference organized by the AFPy (Association Francophone Python) is held in Strasbourg in the end of October 2024!
Here are some references and ressources that inspired this class: