Writing Simple Unix Scripts

Phil Jeffrey, Mar 2007, v0.1

Preamble

Just a little page on how to write simple Unix scripts, not intended to be an elaborate guide. This article is also translated to Serbo-Croatian language by WHG Team

C Shells and Bourne Again Shells

Shells are just the intepreter that adds a more human interface to the Unix operating system. Historically the C shell (csh) and Bourne shell (sh) are the commonest ones. Tcsh (TENEX C shell) is an enhanced version of csh and is found widely - I use this as my default shell. Bash (Bourne Again Sh) is an enhanced version of sh and is also found quite widely. There are also several other shells like ksh, zsh etc - see Wikipedia entry for Unix shells.

Which shell one should use is more a matter of dogma than anything else. Csh and tcsh are probably the more common ones in crystallography because of the enhancements in tcsh were available more widely than bash at some time in the past when we were all still working on SGIs (one of the machines that led the shift away from VAX/VMS as the usual operating system for crystallography). Program suites often come with two different startup or configuration files compatible with the different (sh, csh) shell syntaxes, however when there is only one config file version it is more likely to be written in C shell than in Bourne shell. There's a difference between using shells for command line execution and using them to write elaborate scripts. Perl and Tcl (and perhaps even Python) are better suited to such scripts. Sh/Bash zealots like to point out C shell limitations in the widely disseminated page C shell programming considered harmful but frankly there are better scripting languages to do most of the more advanced system administration functions in Unix.

Consequently, learn Perl or Tcl (or even C) if you want to do cute things with scripts, and Keep It Simple in tcsh or bash.

You can also RTFM tcsh. There are any number of shell guides and introductions if you just Google for them.

Hash Pling Slash (#!/)

The first line of your command file should look like:
#!/bin/csh -f
When you execute a script (e.g. with ./hello_world.csh) then the interpreter (i.e. your command shell) reads this line and understands it to mean:
execute the script as input to the program listed after the #! characters
so you can write Perl scripts, for example, with #!/usr/bin/perl in the header of the script file. In the case shown above the "-f" flag tells csh not to execute your .cshrc when it starts up - this speeds script execution and hopefully it inherits the relevant shell environment when it starts from the command shell. Remove the -f if there seem to be problems with picking up program setups - in particular aliases do not always seem to get inherited.

A very simple script would look like:

#!/bin/csh -f
#
# this is a comment
#
echo "hello world"
which illustrates the first (interpreter) line, the comment lines beginning with "#" and a small piece of code that does something - using the command "echo" to print "hello world" on the screen.

Simple Csh Syntax

The "C Shell Field Guide" by Anderson & Anderson is the definitive reference book, but to summarise simple syntax:

If you're writing shell scripts I assume you already know about redirection, but to reiterate:

The & forms are useful only when you're trying to specifically trap error messages. Do not confuse this usage of & with the "run command in background" method of appending & to the end of the command line. If you have "noclobber" set in tcsh the redirection commands may refuse to overwrite existing files or append to files not already in existence. You have to use ">!" and ">>!" to override this behavior - consult the tcsh manual.

The csh/tcsh feature that we are most concerned with is how to get my data into my program. Specifically you want to get the shell to shove a series of lines into the program being executed rather than interpret them in shell syntax. One very tedious way to achieve this is to do:

echo "first line of program input" > instructions.dat
echo "second line of program input" >> instructions.dat
echo "third line of program input" >> instructions.dat
program_name < instructions.dat
i.e. writing the input syntax to a file via the standard shell syntax for redirection > and >> and then getting it to read from that file via <. There are many situations in which the "program < instructions" paradigm is used, but in the shell there is another way to write the example above via a more compact version:
program_name << EOF-prog
first line of program input
second line of program input
third line of program input
EOF-prog
which says "take everything between the two instances of EOF-prog and treat it as input to the program". Note that the last EOF-prog is not seen by the program - the input is terminated at the third line. EOF-prog is not some sort of special string, it's just any old word. If you have the word in single or double quotes, variable substitution in the program input lines is disabled, but otherwise happens if there's no quotation of the word (this is actually useful in shell programming). If your data lines contain "$" as input and the shell keeps throwing "variable not set" errors, you might want to use single quotes: <<'EOF' rather than the plain <<EOF, the former disables the substitutions. The quote from the manual is:
Reads the C Shell input up to a line that is identical to word. word is not 
subjected to variable, file name or command substitution, and each input 
line is compared to word before any substitutions are done on the input line. 
Unless a quoting \, ", ', or ` appears in word, variable and command 
substitution is performed on the intervening lines, allowing \ to quote $, 
\ and `. Commands that are substituted have all blanks, tabs, and newlines 
preserved, except for the final newline which is dropped. The resultant text 
is placed in an anonymous temporary file that is given to the command as its 
standard input.

I use this construction all the time in CCP4 scripts:

#!/bin/csh -f
# 
# run SHELXC
#
/usr/local/shelx/macosx/shelxc se1 << EOF
HREM se1rm.sca
PEAK se1pk.sca
INFL se1in.sca
LREM se1lo.sca
CELL 35.104  63.496  76.458 90. 90. 90.
SPAG P212121
FIND 8
NTRY 50
EOF
#
I'm lazy and always use the string "EOF" for this method. However you will reduce the potential for mayhem if you make each "EOF" have a distinct name and make sure they appear as pairs. I think any simple string is valid, not just the ones based on "EOF" but EOF is an acronym for End Of File so it has some archaic relevance to what we are doing. Less so if you're not a programmer. In any event make it something distinctive, probably partially upper case, so as not to make it look like data or the name of a program.

Running the Script

First create the script as a simple text file without any additional weird characters (i.e. do it via Emacs or Vi or Cat not via Microsoft Word). For ease of detection I tend to make my Cshell script files end in .csh or .com but the presence of a "file extension" and it's name are entirely optional in Unix. Then, you need to make the script file executible:
chmod +x my_script.csh
will make it executible only to you, or:
chmod a+x my_script.csh
will make it executible by all (everyone) if you are feeling generous.

If you just type "my_script.csh" the shell may or may not find the script. This is because any command that you type in the shell that isn't an absolute or relative path (e.g. /bin/ls, ./myprog, ../myotherprog) is first looked for as a shell built-in (echo is one such command) then the path variable is searched from left to right for a location containing the command "my_script.csh". Do an "echo $path" to see the contents of your path - it's a list of directories in which one is supposed to search for programs. If your path does not contain "." then it will not find such a command in the current directory, and it may find a completely different file called my_script.csh and execute that !!. So while typing "ls" is likely to execute the program /bin/ls it is not guaranteed to do so. For reasons of security and sanity it is best to use the absolute path either as

/Users/phil/Structures/examples/my_script.csh
or the relative path such as
./my_script.csh
For security reasons it's not always a good idea to have "." in your path, but at least if you do make sure it's always the last component of $path. You can add it using a line:
set path = ($path .)
as the last line in your ~/.cshrc - knowing what you are doing before messing with your .cshrc is always an excellent idea.

Getting Into Trouble With More Advanced Shell Syntax

Various examples:

You could create a simple disk space monitoring script:

#!/bin/csh -f
#
#
while (1)
sleep 60
df -kl
end
which introduces the syntax for while....end and also the sleep command - this one just sits there and runs "df -kl" every minute until you kill it.

You can simplify laborious tasks for doing things like calculating the Mean Fractional Isomorphous Difference (MFID) between all possible pairs of MTZ (.mtz) files containing single datasets:

#!/bin/csh -f
#
echo "" > mfid.log
for file1 (*.mtz)
 for file2 (*.mtz)
  echo "Using $file1 and $file2" >> mfid.log
  ./mfid.csh $file1 $file2 >> mfid.log
 end
end
OK, so there's a lot going on in this short script. First the syntax (*.sca) is a list of filenames matching the pattern specified (i.e. all files that end in .sca. The syntax for name (list) ..... end cycles through this list one filename at a time and assigns the list value in turn to the variable whose name you specified. I used two nested "for" loops in the above example. Then the mfid.csh script takes the two file names as arguments and the output of that script gets concatentated to the log file "mfid.log". The two echo commands first create a blank existing log file and also write information on the filenames to the log file (but mfid.csh could also write those names to the output, as an alternative).

Now, how do you get mfid.csh to accept the filenames as arguments ? Well the shell allows this via special variables $0, $1, $2 etc:

#!/bin/csh -f
#
\rm mfid_merged.mtz
#
cad HKLIN1 $1 HKLIN2 $2 HKLOUT  mfid_merged.mtz << eof-cad              

RESOLUTION OVERALL 50.0 6.
SYMMETRY P6122
TITLE  merge two data files together

LABIN  FILE 1 E1=F   E2=SIGF
CTYPE  FILE 1 E1=F   E2=Q  
LABOUT FILE 1 E1=FN1 E2=SIGFN1

LABIN  FILE 2 E1=F   E2=SIGF 
CTYPE  FILE 2 E1=F   E2=Q 
LABOUT FILE 2 E1=FN2 E2=SIGFN2

END
eof-cad
so here the value of the first argument ($1) and the second argument ($2) are used as filenames assigned to HKLIN1 and HKLIN2 by the program "cad". Scaleit, the program that actually calculates the MFID, is not shown but is also part of this script. $0 is the name of the script itself, and it can be useful to print that for debugging purposes.

More examples could go here if I felt that they would do more good than harm.

Tests for file name existence:

if (-e $1) echo "File $1 exists"
if (! -e $1) echo "File $1 does not exist"
More tests like that could use the test command - see "man test". The ones for tcsh include: -e (exist); -o (owned by user); -r (readable); -w (writable); -x (executible); -z (zero size); -d (is directory). All these tests fail if the file does not exist.

Tests for numeric values:

if ($a > $b) echo "A ($a) is more than B ($b)"
if ($a == 9) echo "A is equal to 9"
if ($a <= 9) echo "A is less than or equal to 9"

String comparions:

if ("$q" == "yes") echo "Answer is yes"

If you get to here you are way beyond the point where you should have read the C Shell Field Guide.

Mathematics

Mathematical calculations in Cshell are a nightmare with unweildy syntax. Just use some other script language if you want to do such calculations. Cshell cannot do floating point calculations. Read more here if you are some sort of masochist determined to use Cshell for simple math. If you're going to do that sort of thing Perl, Tcl or Python are almost certainly better options.