Accelerating imputation of missing genotypes using parallel computing
Click here to view fulltext PDF
Owing to massive jump in DNA technology, large-scale genomic datasets, including valuable information, have become available. While this is a prodigious opportunity, and it can also be a big challenge because analysing these large datasets with current computers and software tools is very difficult and may take days or even weeks to complete. Novel approaches such as parallel computinghave been suggested to deal with these large datasets. Here, the effect of parallel computing on the performance of random forest (RF) algorithm for imputation of missing genotypes was studied. To this end, the genotypic matrices were simulated for, respectively, 500, 1000, 2000, and 3000 single-nucleotide polymorphism (SNP) for 500, 1000 and 2000 individuals, respectively. Then, 50% of genotypic information was masked and imputed by RF. The per cent of genotypes correctly imputed was used to measure accuracy of genotypeimputation. Serial and parallel computing were applied to the data. In comparison to serial computing, parallel computing did not affect the accuracy of imputation, and the accuracy was the same in both scenarios. However, regarding computational time, parallel computing accelerated the analyses significantly in a way that it reduced the running time up to 63%. This was due to the fact that in the serial computing, only 10% of the processing power of the central processing unit (CPU) of the machine was used by the RF, while in theparallel computing, 55% of the processing power of the CPU was utilized. Therefore, as parallel computing significantly reduced the computing time and does not affect the accuracy of the results, this approach should be exploited by researchers to analyse large genomic datasets.
Volume 102, 2023
Continuous Article Publishing mode
Click here for Editorial Note on CAP Mode