Protein Folding and Machine Learning: Fundamentals


Abstract in English

In spite of decades of research, much remains to be discovered about folding: the detailed structure of the initial (unfolded) state, vestigial folding instructions remaining only in the unfolded state, the interaction of the molecule with the solvent, instantaneous power at each point within the molecule during folding, the fact that the process is stable in spite of myriad possible disturbances, potential stabilization of trajectory by chaos, and, of course, the exact physical mechanism (code or instructions) by which the folding process is specified in the amino acid sequence. Simulations based upon microscopic physics have had some spectacular successes and continue to improve, particularly as super-computer capabilities increase. The simulations, exciting as they are, are still too slow and expensive to deal with the enormous number of molecules of interest. In this paper, we introduce an approximate model based upon physics, empirics, and information science which is proposed for use in machine learning applications in which very large numbers of sub-simulations must be made. In particular, we focus upon machine learning applications in the learning phase and argue that our model is sufficiently close to the physics that, in spite of its approximate nature, can facilitate stepping through machine learning solutions to explore the mechanics of folding mentioned above. We particularly emphasize the exploration of energy flow (power) within the molecule during folding, the possibility of energy scale invariance (above a threshold), vestigial information in the unfolded state as attractive targets for such machine language analysis, and statistical analysis of an ensemble of folding micro-steps.

Download