Explosive growth in the amount of genomic data is matched by increasing power of consumer grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects.
http://ift.tt/2HtdpdY
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου
Σημείωση: Μόνο ένα μέλος αυτού του ιστολογίου μπορεί να αναρτήσει σχόλιο.