Wednesday, May 2, 2012

Matching limited elements using Postgres IN clause

So you have two tables, one a sequence table holds unique sequence ids.  The other table hold sequence mutations and has a 1 to many relationship with the sequence table.  So for a given sequence id in the mutation table, there are many records, one for each mutation.

Now you have two sets of mutations you are interested in, a primary set A and a secondary set B.  You want to know how many sequences contain 1 and only 1 mutation from set A and 1 and only 1 from set B.


select count(*) from (
SELECT distinct a.seq_id
FROM sequence a
JOIN seq_mutations b ON b.seq_id = a.seq_id
JOIN seq_mutations c ON c.seq_id = a.seq_id
WHERE b.mutation IN ('mutA','mutB','mutC')
AND c.mutation IN ('mut1', 'mut2', 'mut3', 'mut4', 'mut5')
GROUP BY a.seq_id
HAVING 
  count(distinct b.mutation) = 1 
  AND count(distinct c.mutation) = 1) a

This query gives you exact, fine control over the conditions you want to test for.  If you then want to test for 1 and only 1 mutation in set A and 2 and only 2 mutations in set B, just change

count(distinct c.mutation) = 2

in the HAVING clause.  This also lets you use > and <, so you can ask for 3 or more mutations by using

count(distinct c.mutation) >= 3

You can also vary the count for b.mutation to adjust the number of mutations allowed from set A, so any combination can now be extracted.

Thanks to Jake for the code help.