Big Data in Bioinformatics
Bioinformatics is one of the sciences that has faced massive data issues in various areas. As research in the life sciences becomes increasingly dependent on laboratory data, “bioinformatics” seeks to integrate closely related aspects of a field in order to comprehend the mechanism of a phenomenon in its entirety, as well as to comprehend the outputs and make them accessible for practical and research use. Therefore, the integration of these data makes us face a large amount of data; this increase in data will increase with time.
The term “big data” refers to very large sets of data that can be analyzed by computers in a structured or unstructured way, as well as in a homogeneous or heterogeneous way, and that can show patterns, trends, and relationships.
The beginning of bioinformatics dates back more than 50 years. In fact, the foundations of bioinformatics were laid in the early 1960s using computational methods for protein sequence analysis. Research on protein sequences in 1956 led to the report of the first protein sequence related to bovine insulin. A decade later, Dayhoff created the Protein Data Bank as the first bioinformatics database.
Since then, various databases have been gradually created, and with the increase in data, life science researchers have decided to use advanced technologies for data analysis. Many platforms with different applications, such as complete sequencing of multiple genomes, the study of gene expression profiles, study of epigenetic changes, study of mutations, etc., have been created, continuously adding to these data. Since the data in bioinformatics is usually very scattered and includes heterogeneous formats, many tools for analyzing this data type are available. Also, the growth rate of bioinformatics data is very high; for example, the total amount of sequence data generated doubles approximately every seven months. BigFiRSt and Sequence Scanner are two programs used to analyze and sequence this type of data. Among the other challenges that bioinformatics researchers are facing are substantial biological networks. Various tools are also available to analyze these networks. In addition to R software packages, we can also mention Sitescape and Netminer software.
The progress of the tools for big data analysis is essential because it reduces the cost of calculations and increases computing power. Also, biologists no longer use traditional laboratories to discover and investigate biological interactions. Instead, they rely on the massive and constantly growing genomic data made available by various research groups.
Among the important areas of bioinformatics that are highly related to big data are the following:
- Gene expression analysis
- Sequencing
- Determination of protein structure
- Ontology of biological data
To read more, you can refer to the following sources:
- Malviya, R., Sharma, P. K., Sundram, S., Dhanaraj, R. K., & Balusamy, B. (Eds.). (2022). Bioinformatics Tools and Big Data Analytics for Patient Care. CRC Press.
- Branco, I., & Choupina, A. (2021). Bioinformatics: new tools and applications in life science and personalized medicine. Applied microbiology and biotechnology, 105 (3), 937-951.
- Gauthier, J., Vincent, A. T., Charette, S. J., & Derome, N. (2019). A brief history of bioinformatics. Briefings in bioinformatics, 20(6), 1981-1996.
- Pal, S., Mondal, S., Das, G., Khatua, S., & Ghosh, Z. (2020). Big data in biology: The hope and present-day challenges in it. Gene Reports, 21, 100869.
- Kashyap, H., Ahmed, H. A., Hoque, N., Roy, S., & Bhattacharyya, D. K. (2015). Big data analytics in bioinformatics: A machine learning perspective. arXiv preprint arXiv:1506.05101.