WTCCC Dataset Decryption -- my own solution

decryption photo

The Wellcome Trust Case Control Consortium (WTCCC) is a group of 50 research groups across the UK which was established in 2005. The WTCCC aims were to exploit progress in understanding of patterns of human genome sequence variation along with advances in high-throughput genotyping technologies, and to explore the utility, design and analyses of genome-wide association (GWA) studies. The WTCCC has substantially increased the number of genes known to play a role in the development of some of our most common diseases and has to date identified approximately 90 new variants across all of the diseases analysed. As well as confirming many of the known associations, some 28 in total, the WTCCC has also identified many novel variants that affect susceptibility to disease.

Officially the decryption software is written in Java, distributed in the form of .jar file. Because WTCCC is a fairly large dataset, and its file type is not friendly to Windows system, I chose to decrypt in Linux, more specifically, Ubuntu 14.04x64. And my hardware configuration is i7-4790 @ 3.60GHz，8GB DDR3 memory desktop rig.

Speaking of decryption, there are two ways to decrypt a file, one is going into decryption software's own console and type many commands, just to decrypt one file! So you can imagine the arduous work one has to endure to decrypt a single file, not to mention there are thousands of them! And make no mistake this is all based on ZERO human error, which we all know chance of that is next to impossible.

Fortunately, they do offer a way to call .jar file and passing on arguments through a shell. Which makes shell script automation become an option. While in this way the core mechanism is still decrypt one file at a time, but the upside being using loop in shell script we can ask computer to automatically decrypt all the file one by one without the time cost in manually switching files as well as the risk of human error, since all acts are the same command being executed over and over again.

Without further ado, here are my codes.


        cmd1="java -jar softwareName.jar -pf login.txt -dc /[path to WTCCC]/[string-partial-omitted]/PART_01/"
        cmd2=" -dck [decryption key]"
        for filename in `ls /[path to WTCCC]/PART_01` # use [ls | grep -v -e "filename1" -e "filename2"] or just [ls | grep -v "filename"] to omit files you don't want to process. 
        do
            cmd=${cmd1}${filename}${cmd2}
            #echo $cmd
            eval $cmd
        done

login.txt contains my username and password to decrypt WTCCC files. In login.txt, first line is username and second line is password.

For multiple datasets, I used a simple approach: duplicate the script's codes for decrypting one dataset and by changing the path to different datasets, I then combined them into one .sh file. By execute this one file, I'm able to decrypt multiple datasets.