code.ex(python): 2011

Wednesday, October 5, 2011

cron Hell

Finally solved a nightmare trying to get a python script to execute properly (at all!) under crontab on Linux 64-bit machine.

Things I learned:

1. cron executes using /bin/sh so if you want to execute a bash script, you need to call it explicitly in crontab like this */1 * * * * root /bin/bash /opt/bin/hcv_update.cron

2. cron does not have it's own environment variables, so you need to figure out how to get it to see those for the user you want to run as (root). The easiest way to do this is to make a shell script that calls your python script. You have more control over what is going on. Also, you can source your .bash_profile so that the variables are now inherited.

#!/bin/bash

source /root/.bash_profile

# calls python update script

/usr/local/bin/python2.7 /opt/bin/hcv_comm_query.py

3. If you have more than one version of python on the system (RedHat always has 2.4 present), make sure you explicitly call the right one.

4. In this case, was using cx_Oracle as well, so explicitly defined the Oracle paths in the python script just in case. This shouldn't be necessary though if they are already present in the .bash_profile

import os

os.environ["LD_LIBRARY_PATH"]="/usr/lib/oracle/11.2/client64/lib"

os.environ["ORACLE_HOME"]="/usr/lib/oracle/11.2/client64"

os.environ["TNS_ADMIN"]="/usr/lib/oracle/11.2/client64"

import cx_Oracle as cxo

This should solve the problem.

Thursday, September 29, 2011

Django, Apache 2 and mod_wsgi

A pain in the neck to set up, but nice to have.

1. Follow installation instructions for mod_wsgi and django
2. Add to httpd.conf

# ----------- Django / WSGI Configuration ----------
WSGIDaemonProcess processes=2 threads=15

WSGIScriptAlias /mgrm "/home/django/var/www/mgrm/apache/django.wsgi"
<Directory "/home/django/var/www/mgrm/">
Order allow,deny
Allow from all
</Directory>

Alias "/static/admin" "/opt/python2.7/lib/python2.7/site-packages/django/contrib/admin/media/"
<Directory "/opt/python2.7/lib/python2.7/site-packages/django/contrib/admin/media/">
Order allow,deny
Allow from all
</Directory>

*Note that the alias for "static/admin" must match whatever alias and path are in the main settings.py file in your Django project directory

3. Create a django.wsgi file in the place that is specified by the path you used in the WSGIScriptAlias that contains the following:

import os
import sys

# Option one
#sys.path.append('/home/django/var/www')
#sys.path.append('/home/django/var/www/mgrm')
#os.environ['DJANGO_SETTINGS_MODULE'] = 'mgrm.settings'
#os.environ['DJANGO_ENV'] = 'PRODUCTION'

# Option two
# from http://blog.dscpl.com.au/2010/03/improved-wsgi-script-for-use-with.html
sys.path.insert(0,'/home/django/var/www/mgrm')
import settings
import django.core.management
django.core.management.setup_environ(settings)
utility = django.core.management.ManagementUtility()
command = utility.fetch_command('runserver')
command.validate()
import django.conf
import django.utils
django.utils.translation.activate(django.conf.settings.LANGUAGE_CODE)

# Common to both options
import django.core.handlers.wsgi

application = django.core.handlers.wsgi.WSGIHandler()

*Note that this file goes inside your django project folder, preferably in an apache folder.

5. Edit main urls.py so that you can use files with both the django server and apache like this:

urlpatterns = patterns('',

# mod_wsgi does NOT pass the '/mgrm' mount point to this application. However,
# the django development server does. So in order to get these urls.py to
# work correctly with both, I created a match group that doesn't create a
# back reference. That match group is this: (?:mgrm/)?
url(r'^(?:mgrm/)?polls/',include('polls.urls')),

# Admin sites are doing some reverse url lookup, and the match group trick
# doesn't work with them. To resolve this issue we create two references:
# one for mod_wsgi, and the other for the development server.
url(r'^admin/',include(admin.site.urls)),
url(r'^mgrm/admin/', include(admin.site.urls)),
)

6. May need to edit templates to add the application name to each url in the template.

Wednesday, September 14, 2011

How to get Django to see multiple PostgreSQL schemas

Took awhile to figure this out, so here goes.

First create a PostgreSQL user that will be used by Django to connect to the database. This is the user that will be included in the settings.py file for the database connection section.

Log into PostgreSQL as admin/superuser and issue the following command:

GRANT USAGE SCHEMA foo TO django_user;

(Or GRANT USAGE to any role which has django_user as a (direct or indirect) member.)
(Or GRANT ALL ... if that is what you want.)

The next step is to change the default schema search path. To make a permanent change, do the following:

ALTER ROLE django_user SET SEARCH_PATH to "$user",public,your_schema;

Log out and log back in for the change to take effect. You can test the outcome by doing a \dt and you should see all table from all schemas that the role has been granted access to.

You can now run manage.py inspectdb and it will see all tables in all schemas. Don't know yet how it will treat tables with the same name in different schemas, as it is no longer required to prefix the schema name in a query, although it can still be done.

Friday, July 29, 2011

ClustalW2 Command line Arguments

CLUSTAL 2.0.12 Multiple Sequence Alignments

DATA (sequences)

-INFILE=file.ext :input sequences.
-PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment).

VERBS (do things)

-OPTIONS :list the command line parameters
-HELP or -CHECK :outline the command line params.
-FULLHELP :output full help content.
-ALIGN :do full multiple alignment.
-TREE :calculate NJ tree.
-PIM :output percent identity matrix (while calculating the tree)
-BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
-CONVERT :output the input sequences in a different file format.

PARAMETERS (set things)

***General settings:****
-INTERACTIVE :read command line, then enter normal interactive menus
-QUICKTREE :use FAST algorithm for the alignment guide tree
-TYPE= :PROTEIN or DNA sequences
-NEGATIVE :protein alignment with negative values in matrix
-OUTFILE= :sequence alignment file name
-OUTPUT= :GCG, GDE, PHYLIP, PIR or NEXUS
-OUTORDER= :INPUT or ALIGNED
-CASE :LOWER or UPPER (for GDE output only)
-SEQNOS= :OFF or ON (for Clustal output only)
-SEQNO_RANGE=:OFF or ON (NEW: for all output formats)
-RANGE=m,n :sequence range to write starting m to m+n
-MAXSEQLEN=n :maximum allowed input sequence length
-QUIET :Reduce console output to minimum
-STATS= :Log some alignents statistics to file

***Fast Pairwise Alignments:***
-KTUPLE=n :word size
-TOPDIAGS=n :number of best diags.
-WINDOW=n :window around best diags.
-PAIRGAP=n :gap penalty
-SCORE :PERCENT or ABSOLUTE

***Slow Pairwise Alignments:***
-PWMATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
-PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
-PWGAPOPEN=f :gap opening penalty
-PWGAPEXT=f :gap opening penalty

***Multiple Alignments:***
-NEWTREE= :file for new guide tree
-USETREE= :file for old guide tree
-MATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
-DNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
-GAPOPEN=f :gap opening penalty
-GAPEXT=f :gap extension penalty
-ENDGAPS :no end gap separation pen.
-GAPDIST=n :gap separation pen. range
-NOPGAP :residue-specific gaps off
-NOHGAP :hydrophilic gaps off
-HGAPRESIDUES= :list hydrophilic res.
-MAXDIV=n :% ident. for delay
-TYPE= :PROTEIN or DNA
-TRANSWEIGHT=f :transitions weighting
-ITERATION= :NONE or TREE or ALIGNMENT
-NUMITER=n :maximum number of iterations to perform
-NOWEIGHTS :disable sequence weighting

***Profile Alignments:***
-PROFILE :Merge two alignments by profile alignment
-NEWTREE1= :file for new guide tree for profile1
-NEWTREE2= :file for new guide tree for profile2
-USETREE1= :file for old guide tree for profile1
-USETREE2= :file for old guide tree for profile2

***Sequence to Profile Alignments:***
-SEQUENCES :Sequentially add profile2 sequences to profile1 alignment
-NEWTREE= :file for new guide tree
-USETREE= :file for old guide tree

***Structure Alignments:***
-NOSECSTR1 :do not use secondary structure-gap penalty mask for profile 1
-NOSECSTR2 :do not use secondary structure-gap penalty mask for profile 2
-SECSTROUT=STRUCTURE or MASK or BOTH or NONE :output in alignment file
-HELIXGAP=n :gap penalty for helix core residues
-STRANDGAP=n :gap penalty for strand core residues
-LOOPGAP=n :gap penalty for loop regions
-TERMINALGAP=n :gap penalty for structure termini
-HELIXENDIN=n :number of residues inside helix to be treated as terminal
-HELIXENDOUT=n :number of residues outside helix to be treated as terminal
-STRANDENDIN=n :number of residues inside strand to be treated as terminal
-STRANDENDOUT=n:number of residues outside strand to be treated as terminal

***Trees:***
-OUTPUTTREE=nj OR phylip OR dist OR nexus
-SEED=n :seed number for bootstraps.
-KIMURA :use Kimura's correction.
-TOSSGAPS :ignore positions with gaps.
-BOOTLABELS=node OR branch :position of bootstrap values in tree display
-CLUSTERING= :NJ or UPGMA

>> HELP 0 << Help for tree output format options

Four output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances
4) Nexus

None of these formats displays the results graphically. Many packages can
display trees in the the PHYLIP format 2) below. It can also be imported into
the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display.
NEXUS format trees can be read by PAUP and MacClade.

1) Clustal format output.
This format is verbose and lists all of the distances between the sequences and
the number of alignment positions used for each. The tree is described at the
end of the file. It lists the sequences that are joined at each alignment step
and the branch lengths. After two sequences are joined, it is referred to later
as a NODE. The number of a NODE is the number of the lowest sequence in that
NODE.

2) Phylip format output.
This format is the New Hampshire format, used by many phylogenetic analysis
packages. It consists of a series of nested parentheses, describing the
branching order, with the sequence names and branch lengths. It can be used by
the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the
trees graphically. This is the same format used during multiple alignment for
the guide trees.

Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other
packages that can read and display New Hampshire format are TreeView (Mac/PC),
TreeTool (UNIX), and Phylowin.

3) The distances only.
This format just outputs a matrix of all the pairwise distances in a format
that can be used by the Phylip package. It used to be useful when one could not
produce distances from protein sequences in the Phylip package but is now
redundant (Protdist of Phylip 3.5 now does this).

4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,
including PAUP and MacClade. The format is described fully in:
Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997.
NEXUS: an extensible file format for systematic information.
Systematic Biology 46:590-621.

5) TOGGLE PHYLIP BOOTSTRAP POSITIONS
By default, the bootstrap values are placed on the nodes of the phylip format
output tree. This is inaccurate as the bootstrap values should be associated
with the tree branches and not the nodes. However, this format can be read and
displayed by TreeTool, TreeView and Phylowin. An option is available to
correctly place the bootstrap values on the branches with which they are
associated.

Setting up X Windows on Mac

For Snow Leopard.

First check /usr/etc/sshd_config and make sure that "X11 Forwarding yes" has been set.

Then login to remote server with ssh -X user@remote.server

Start remote desktop (e.g. gnome) with gnome-session

Tuesday, March 29, 2011

Adding PHP module to default OS 10.6.1 PHP stack

The current system is Snow Leopard 10.6.1 and I want to add PostgreSQL support to the default PHP installation. Snow Leopard comes with PHP 5.3.4 already installed in Apple's weird, distributed way. However, the current distro for PHP is 5.3.6 at the time of this writing, so what to do? I found the solution scattered across many different blogs, so I am synthesizing it here. None of this was my own creation.

First, grab a copy of the source code that matches what is already installed. Probably won't find it on PHP.net, so try this link: php-5.3.4

I created a /src directory to store source code in. Copy the tar file into here or a similar directory and unpack it.

Change to that directory:

>cd /src/php-5.3.4

Set some environment variables before doing the configuration

>export MACOSX_DEPLOYMENT_TARGET=10.6.7
>export CFLAGS="-arch x86_64"
>export CXXFLAGS="-arch x86_64"
>export LDFLAGS="-arch x86_64"

Go to the pgsql source directory in php ext folder

>cd ext/pgsql

Compile the extension module

>phpize
>./configure
>make

The extension will be found here

>cd /src/php-5.3.4/ext/pgsql/.libs/
>ls
-rwxr-xr-x 1 Bali admin 154K Mar 29 12:41 pgsql.so

Copy the extension to the extensions library and make sure it is executable

>sudo cp pgsql.so /usr/lib/php/extensions/no-debug-non-zts-20090626/
>cd /usr/lib/php/extensions/no-debug-non-zts-20090626/
>sudo chmod +x pgsql.so

Create a copy of the php.ini file if one does not already exist

>sudo cp /etc/php.ini.default /etc/php.ini

Edit the php.ini file and add the following two lines:

extension_dir="/usr/lib/php/extensions/no-debug-non-zts-20090626/"
extension=pgsql.so

Save and then test that the extension is loaded properly by running the following at the command line:

>php -m

You should see a list of installed modules, including pgsql. Then go back and restart Apache

>/usr/sbin/apachectl graceful

Run phpinfo to verify the module has been loaded. You may have to scroll down to see it.

That is it.

Friday, February 4, 2011

Installing libsvm-3.0 for Python on OSX 10.6

This was very frustrating to solve and I eventually had to get my friend Kieran (thanks Kieran!) to help me. Basically the vanilla installation instructions that come with are insufficient (at least for me) to get a working module. Here are the steps that were required to get everything to work.

1. Run 'make' in the libsvm-3.0 directory

2. Run 'make' in the libsvm-3.0/python directory

In the libsvm-3.0 directory there now should be a .so2 file

3. Create a new directory in your site-packages directory (your pythonpath) called libsvm

4. Copy the .so2 file from libsvm-3.0 and the svm.py, svm.pyc, svmutil.py files from libsvm-3.0/python to site-packages/libsvm

There are a couple things that are missing so now we need to make them.

5. In site-packages/libsvm create a file called __init__.py. This is an empty file, but it is necessary to get the directory recognized as a python module.

6. Edit svm.py and add the following two lines after the other import statements at the top of the file:

   import os.path
   _PATH = os.path.join( *os.path.split(__file__)[:-1] )

7. At around line 7 you will see this statement

   # For unix the prefix 'lib' is not considered.

   if find_library('svm'):
   libsvm = CDLL(find_library('svm'))
   elif find_library('libsvm'):
   libsvm = CDLL(find_library('libsvm'))
   else:
   if sys.platform == 'win32':
   libsvm = CDLL('../windows/libsvm.dll')
   else:
   libsvm = CDLL('../libsvm.so.2')

8. Change this to look like this:

   # For unix the prefix 'lib' is not considered.
   if find_library('svm'):
libsvm = CDLL(find_library('svm'))
   elif find_library('libsvm'):
libsvm = CDLL(find_library('libsvm'))
   else:
if sys.platform == 'win32':
libsvm = CDLL(os.path.join(_PATH,'windows','libsvm.dll'))
else:
libsvm = CDLL(os.path.join(_PATH,'libsvm.so.2'))

9. Once you save svm.py, you should be able to fire up a python interpreter and do 'from libsvm import svm'. If that works, and a dir(svm) shows you a ton of functions, then you are good to go.

Wednesday, January 19, 2011

Generate all possible proteins from ambiguous DNA

This had me stumped for awhile, but this works pretty well. Does NOT handle stop codons or gap characters like '-'. Requires BioPython

import itertools
from Bio.Seq import Seq
from Bio.Data import CodonTable
from Bio.Data import IUPACData</pre>

# Takes Bio.Seq.Seq object as input
# Returns list of all possible proteins
# Assumes sequence is in frame +1
def generateProtFromAmbiguousDNA(s):
   std_nt = CodonTable.unambiguous_dna_by_name["Standard"]
   nonstd = IUPACData.ambiguous_dna_values
   aa_trans = []
   for i in range(0,len(s),3):
      codon = s.tostring()[i:i+3]
      aa = CodonTable.list_possible_proteins(codon,std_nt.forward_table,nonstd) 
      aa_trans.append(aa)
   proteins = list(itertools.product(*aa_trans))
   possible_proteins = []
   for x in proteins:
      possible_proteins.append("".join(x))
   return possible_proteins

def main():
   a = Seq('ATGGCARTTGTAHAC')
   print "DNA: ",a.tostring()
   print "Proteins:"
   foo = generateProtFromAmbiguousDNA(a)
   for s in foo: print s

if __name__ == '__main__':
   main()

Creating a quick codon table

I didn't think this up, the code comes from Peter Collingridge here. But it is rather elegant.


bases = ['t', 'c', 'a', 'g']
codons = [a+b+c for a in bases for b in bases for c in bases]
amino_acids = "F F L L S S S S Y Y stop stop C C stop W L L L L P P P P H H Q Q R R R R I I I M T T T T N N K K S S R R V V V V A A A A D D E E G G G G".split(' ')
codon_table = dict(zip(codons, amino_acids))

Thursday, January 13, 2011

Update the locate database on the Mac

This is the command for updating the locate database on the OSX system.

sudo /usr/libexec/locate.updatedb

I should figure out how to make this run everyday.

Wednesday, January 5, 2011

Connecting to PostgreSQL with Python and Psycopg2

Basic syntax for making a database connection, executing and retrieving data:

import psycopg2 as pg

# create database connection
try:
   conn = pg.connect("dbname='template1' user='dbuser' host='localhost' password='dbpass'")
except:
   print "Unable to connect to database"


# create database cursor
cur = conn.cursor()


# execute SQL and fetch results
cur.execute("""SELECT datname from pg_database""")
rows = cur.fetchall()


print "\nShow database results:\n"
for row in rows:
   print row[0]