Course Homepage
Sept 2018
This project is maintained by BIG-SA
2018-04-04
Jimmy Breen (jimmymbreen@gmail.com) Stephen Pederson Paul Wang John Toubia
This session we will be tying together many of the concepts that you have learnt over the last few weeks. Today we’ll be learning about:
bash
Scriptsfor
loopsMuch like we have begun all other weeks, if you didn’t create a folder for last week’s files, let’s create one and put all of today’s work in there. First navigate to your home folder, which may be one of /home/
cd ~
From here let’s create a new folder for today:
mkdir -p BashWk4/files
cd BashWk4/files
pwd
To run example scripts in this tutorial, we will need some example data. The Australian government provide a large amount of open data on the website data.gov.au, and to create scripts we will use a dataset containing information about particle pollution data for the year 2015 and ongoing for the Adelaide CBD region.
Side note: Public datasets are becoming increasily common in today’s society, and many companies and media outlets use this data to investigate everything from the missuse of public funds, to environmental monitoring. A brilliant example of “data journalism” is the US website fivethirtyeight.
To download this data, we will need to go to our “files” directory, run the wget
command to get a zip file containing all our csv files, and unpack the zip file:
wget -c https://data.sa.gov.au/data/dataset/9fd65c8d-a3bc-474e-9cf2-03a58a837fc0/resource/a0fa35fb-fedf-4db6-8bbb-668f9959fe42/download/adl07p.zip
unzip adl07p.zip
Sometimes we need to perform repetitive tasks on multiple files, or need to perform complex series of tasks and writing the set of instructions as a script is a very powerful way of performing these tasks. They are also an excellent way of ensuring the commands you have used in your research are retained for future reference. Keeping copies of all electronic processes to ensure reproducibility is a very important component of any research.
Many people can get freaked out by the complexity of a program “script”, but its really quite simple. Think of a script as a to-do list that you are giving the computer to run a task. The script/code is sent to a interpreter (in our case, we mean the program bash
), which then converts the to-do list into a language that the computer can understand and execute.
Programmers will often start a project with a basic outline of the tasks that we need to do. Often we call it “pseudo-code”.
Have a look at our Adelaide CBD dataset by using the command less
on the first month file of 2015 (“./ADL07p/ADL07p_1hr201501.csv”). Theres a lot of data in there, but we want to simplify it by only grabbing the 1st column (“Date/Time”) and the 4th and 5th columns (“Temperature Deg C” and “Barometric Pressure atm”). To do this I would write the following pseudo-code:
Step 1: Change into our data directory
Step 2: Open input file
Step 3: Cut the file into columns
Step 3: Print 1st, 4th and 5th column
Step 4: Output the result into a new file
This simple example can be written as a bash script below (Don’t worry that you don’t understand all of it yet. That will come later):
#!/bin/bash
# If you havent already, change into the files directory in the "BASH-Intro-2018" diectory
# The following line will need to be modified accordingly if you saved the files somewhere other than ~/BashWk4/files
cd ~/BashWk4/files
# Read the input file in my current directory into a variable
INPUT="ADL07p/ADL07p_1hr201501.csv"
# Use unix cut to divide the file into columns via the ","
# - print each column except 2 and 3
# - create a new file
cut -d',' -f1,4,5 ${INPUT} > ${INPUT}.new.csv
Using your text editor of choice, save this file as “basic_example_1.sh”. To then run this script on the command-line, we would run:
$ bash basic_example_1.sh
Questions
#
symbol: Shebang and CommentsThe example above displayed a very basic example of a bash shell script.
Every bash shell script begins with what is known as a “shebang”, which we would commonly recognise as a hash sign followed by an exclamation mark, i.e #!. This is immediately followed by /bin/bash, which tells the interpreter to run the command bash
in the directory /bin. This opening sequence is vital & tells the computer how to respond to all of the following commands. As a string this looks like:
#!/bin/bash
The hash symbol generally functions as a comment character in scripts (as shown above in “Basic example 1”). Sometimes we can include lines in a script to remind ourselves what we’re trying to do, and we can preface these with the hash to ensure the interpreter doesn’t try to run them. It’s presence as a comment here, followed by the exclamation mark, is specifically looked for by the interpreter but beyond this specific occurrence, comment lines are generally ignored by scripts & programs.
Comments are very important in programming because they act as notes or explainations so you can understand what you were thinking when you wrote it. If you look at your code 6 months from now, there is a very strong chance that you won’t recall exactly what you were thinking at the time, so these comments can be a good place just to explain something to the future version of yourself. There is a school of thought which says that you write code primarily for humans to read, not for the computer to understand.
An important concept in UNIX computing is file permissions. There are files in which you look at and interact with when you are running analysis on a UNIX command line, and there are files in which are incredibily important to the system that we shouldnt be able to touch. And of course, if you are creating and interacting with files in your home directory, you want to make those files only available to you, and not other users.
A file can have there are three types of attributes:
To demonstrate what this means, I’m going to run a simple ls
command on the data directory that we downloaded and unpacked for this tutorial:
ls -l ./ADL07p
The result on my computer is:
total 2192
-rw-r--r-- 1 jbreen staff 26826 23 Jun 2016 ADL07p_1hr201501.csv
-rw-r--r-- 1 jbreen staff 24288 23 Jun 2016 ADL07p_1hr201502.csv
-rw-r--r-- 1 jbreen staff 26977 23 Jun 2016 ADL07p_1hr201503.csv
-rw-r--r-- 1 jbreen staff 25593 23 Jun 2016 ADL07p_1hr201504.csv
...
For each file there is a number of blocks of information. For now, we don’t need to know what everything means, but initially you might be able to guess what each means. The third and fourth block of information (i.e. jbreen
and staff
) are the owner and group permissions. After that there is the size of the file (in bytes), the date that the file was last edited and then the name of the file.
The first block contains the important information and that is regarding file access permissions. These permissions follow the format of read(r), write(w) and execute(x). The block contains the following information:
Directory | Owner | Group | Other |
---|---|---|---|
- or “d” | rwx | rwx | rwx |
So in my directory, ADL07p_1hr201501.csv
has the permissions of -rw-r--r--
which means that its a file (not a directory), where jbreen
(the owner) has read and write access but can’t execute it rw-
, members of the group (staff
) has read access only and everyone else has read access only.
There are two main ways of executing a script. Firstly, as shown in “Basic Example 1”, we can just declare the intepreter of the language on the command-line, followed by the name of the script.
$ bash basic_example_1.sh
However, we shouldnt need to call the name of the script, considering that the interpreter is already declared in line 1! To do this, the script needs to be executible, and we need to adjust the read/write/execute file permissions explained above. By adding execute permissions to the file, the script can be run as a program and not just a regular file.
First we’ll look at the files in the folder using ls -l
and note that unlike the above figure, these triplets should be rw-
for the user & the group you belong to. To make this script executable, enter the following in your terminal.
$ chmod +x ./basic_example_1.sh
If you run ls -l
again, you’ll notice that the third flag in the triplet has now become an x. This indicates that we can now execute the file in the terminal. As a security measure, Linux doesn’t allow you to execute a script from within the same directory so to execute it enter the following:
$ ./basic_example_1.sh
Let’s now look at another simple scripts.
If you look back at “Basic Example 1”, when we ran the command cut -d',' -f1,4,5 ${INPUT}
, it made a file called ${INPUT}.new.csv
. This looks a little bit messy because if your input file is a csv file, you end up producing an output file that has the extension .csv.new.csv
. Additionally, when running commands you are not in the directory that contains the file, the output file can be saved in a different location.
To overcome these issues, we are able to use a few different UNIX commands to help orientate us within the file system.
basename ${INPUT}
: Outputs the name of the file without the full path.dirname ${INPUT}
: Outputs the name of directory in relation to the file.readlink -f ${INPUT}
: Outputs the full name and full path of the filepwd
: Print full path of the directory you are currently inThe command basename
is incredibly useful in bash scripting because it can also be used to strip a specific file extension of a file. For example, lets re-write “Basic Example 1” using basename
to remove the file extension and save the file without the hideous .csv.new.csv
.
#!/bin/bash
# If you havent already, change into the files directory in the "BashWk4" diectory
#cd ./files
# Read the input file in my current directory into a variable
INPUT="ADL07p/ADL07p_1hr201501.csv"
# Get the filename without the path and extension
DIR=$(dirname ${INPUT})
NAME=$(basename ${INPUT} .csv)
BASE=$(pwd)
echo "I am currently in the directory: ${BASE}"
echo "The file directory name is ${DIR}"
echo "The file base name is ${NAME}"
# Use unix cut to divide the file into columns via the ","
# - print each column except 2 and 3
# - create a new file
cut -d',' -f1,4,5 ${INPUT} > ${BASE}/${DIR}/newfile.${NAME}.csv
As you’ve probably noticed above, a variable is essentially a holding place for information that the program needs to run its code. In “Basic example 1”, we read our input file into the variable INPUT and in “Basic example 2”, we will read text into the variables ME and MESSAGE. You’ll notice that when you declare the variable you use the equals sign to assign the information to that variable name (e.g. VARIABLE_NAME=”THIS IS THE INFORMATION”), while when you actually use the variable in your code, we put a $
in front to declare that this is in fact a variable.
Additionally, notice the use of the curly brackets around the variable name in “Basic example 1” (e.g. ${INPUT}
). Whilst not being strictly required, this can make it easy for you to follow in the future when you’re looking back. Its also helpful to type variables using strictly upper-case letters. This is another optional coding style, but can also make things clear for you as you look back through your work. Most command line tools use strictly lower-case names, so this is another reason the upper-case variable names can be helpful.
TLDP.org defines variable as:
"in bash (a variable) can contain a number, a character, a string of characters. You have no need to declare a variable, just assigning a value to its reference will create it."
This can be anything from the name of a file (as we’ve seen in “Basic Example 1” where we read our file into the variable ${INPUT}
), a number or string or even the output of a unix command (more of those later)
Additionally, there are a number of special variables that can be used when writing bash scripts, and these have certain behaviours:
$0
- The name of the Bash script.$1
-> $9
- The first 9 arguments to the Bash script.$#
- How many arguments were passed to the Bash script.$@
- All the arguments supplied to the Bash script.$?
- The exit status of the most recently run process.$$
- The process ID of the current script.$USER
- The username of the user running the script.$HOSTNAME
- The hostname of the machine the script is running on.$SECONDS
- The number of seconds since the script was started.$RANDOM
- Returns a different random number each time is it referred to.$LINENO
- Returns the current line number in the Bash script.Using the text editor gedit, enter the code below into a file setting your actual name as the ME variable, and save it as wellDone.sh in your home folder.
#!/bin/bash
# First we'll declare some variables with some text strings
ME='Put your name here'
MESSAGE='This is your first script'
# Now well place these variables into a command to get some output
echo -e "Hello ${ME}\n${MESSAGE}\nWell Done!"
Now change permissions and execute the script, and see what the output is
Consider the ME
variable in the script above. Let’s change the variable to read a special variable $1
:
ME=$1
Now save this as wellDone2.sh
. You’ll now need to set the execute permission again.
chmod +x wellDone2.sh
This time we have set the script to receive input from a command-line argument, and we will need to supply a value, which will then be placed in the variable ME. Choose whichever random name you want and enter the following
$ ./wellDone2.sh Boris
So its clear that you can assign a number or a string to a variable (e.g. ‘Put your name here’ and ‘This is your first script’ was assigned to the variables ME and MESSAGE in Basic Example 2), but you can also capture output of a unix command and assign it to a variable.
Type the following script into a text file and save it as “count_lines.sh”:
#!/bin/bash
# Read in my input file into INPUT
INPUT=$1
# Count the number of lines using wc and read into LINES
LINES=$(wc -l ${INPUT})
# print the variable to see the result
echo "The number of hours contained in the data is ${LINES}"
Set the permissions and execute the file by declaring the name of the script and one of our Adelaide CBD csv files:
$ ./count_lines.sh ADL07p/ADL07p_1hr201704.csv
The Adelaide CBD Particule data is measured by a Beta Attenuation Monitor (BAM) where the two individual measurements are PM10 BAM µg/m3 (Particulate Matter 10 microns or less) and PM2.5 BAM µg/m3 (Particulate Matter 2.5 microns or less). Across every month in 2015, during what time of the day was the PM10 and PM2.5 measurements at their highest?
Using the skills in this tutorial, as well as your knowledge of commands such as echo
, awk
, grep
, write a script that reads the 2015 data and outputs the answer
When we write scripts, we generally do it for a specific purpose, and therefore we are generally in control of the input data and the execution of the code that we have created. However, if I was to create a program for others to use, we often need to make sure the script is robust to any issue that the user defined inputs might throw at it.
Thats where control statements are helpful. They are used in almost all programming languages and act as a traffic light or decision structure for tasks to proceed. The main control statement that you will use in your scripting are if
or if/else
statements, which have the following format:
if <condition>
then
### series of code goes here
fi
This statement only makes one decision. If the condition is confirmed, then it runs the code after the then
command and finishes. But you can have multiple decisions:
if <condition>
then
### series of code if the condition is satisfied
else
### series of code if the condition is not satisfied
fi
or
if <condition1>
then
### series of code for condition1
elif <condition2>
then
### series of code for condition2
else
### series of code if the condition is not satisfied
fi
NOTE: Be sure to note the use of formatting on control statements. Generally, the then
statement is one space indented from the if
command, and the code that is inside the conditional is 1 tab inside. You do not need the tab or space for the control statment to work, but its considered good programming practise to separate your code so its clear to the person reading it. The same is done with for
loops, which we will introduce later.
There are three types of conditions that you can use in an if
statment: String, File and Arithmetic-based conditions. Today we will only cover a few, but if you want to know all of them, check out this page.
Like the explanation above, if
statements are good for file checking. For example:
#!/bin/bash
cd
ls
if [ -e sample.sh ]
then
echo "file exists!"
else
echo "file does not exist"
fi
The -e
flag here means “return true if file exists”.
Here is a real-life example from a whole-genome bisuflite sequencing project that I was working on previously:
I’ve made a script that takes in 3 command-line arguments at the time of execution; a reference genome, the name of the directory where all my files are and the suffix of the file that I want to test. So the script can take 3 commands, but what if the user accidently puts in 2? Or 1? Well I used a control statement to check whether the user has done the correct thing. If they havent, I want to print out a little help page that informs them what the correct usage of the script is.
#!/bin/bash -l
suff=$3
dir=$2
ref=$1
# Input variable checks
if [ $# -lt 3 ]
then
echo "Usage: $0 <ref> <directory containing bedGraph(gz)> <file suffix>\n"
echo "\n"
echo "Example: $0 hg19.fasta directory _CpG.bedGraph.gz"
exit 1
fi
The if
statement here essentially means, “if there are less than 3 arguments (arguments here is “$#”) at time of execution, then print out three echo commands and exit”.
Using your Adelaide CBD air data, write a script that counts the number of fields/columns in each file. Use a control statement to only count files that have no empty fields (i.e. columns with no data in them)
for
loopsSo far we’ve just touched on executing one file at a time. But what if you want to run the same command on multiple files?
For this we can use a for
loop. A loop is a type of statement that enables the programmer to execute repetitive tasks in one small bit of code, instead of just repeatedly writing the command. The loop has three elements; a) the item, b) the iteratable object or list, and c) command to run on our item.
Take our to-do list analogy for example. Instead of writing:
From the grocery store, buy milk
From the grocery store, buy cheese
From the grocery store, buy coffee
From the grocery store, buy onions
From the grocery store, buy bread
From the grocery store, buy chocloate
From the grocery store, buy toilet paper
From the grocery store, buy tomatoes
From the grocery store, buy bacon
We could write something like:
for item in grocery_store
do
buy item
done
In this context, the word “item” ends up being the variable, and so we run the command “buy” on each “item” in the list (or iterable object) “grocery_store”. Each loop requires you to start the for loop with the command for
, indicate the command by adding the command do
and the closing the loop by adding done
at the end. We generally separate out this loop structure on each line, with the command indented by one tab character. This is an established style throughout most programming languages, but the loop can easily be run on one line using semi-colons as separators:
for item in grocery_store; do buy item; done
Lets write a script that shows you how a for loop works
In the following example we are going to read some text and make four files
#!/bin/bash
# Read the path of the current directory into the variable WORKING_DIR
WORKING_DIR=$(pwd)
# Create a new directory called results where our files will be saved
mkdir -p ${WORKING_DIR}/results
# Loop: 1.Iterate over the list of numbers "1 2 3 4 5" and read into NUM
# 2. Echo some text into a new file
# and 3. print file using the command cat
for NUM in 1 2 3 4 5
do
echo "The file has the number ${NUM} in it" > ${WORKING_DIR}/results/file_"${NUM}".txt
cat ${WORKING_DIR}/results/file_"${NUM}".txt
done
Now save this script and run it:
bash basic_example_3.sh
Write a script that uses a for
loop to rename each file in the “ADL07p” directory. Rename the file so it has the word “newfile_” at the front of the file (e.g. ADL07p_1hr201710.csv -> newfile_ADL07p_1hr201710.csv)
Create a script that does the following: