Karen Hao at The Wall Street Journal: “In their search for new disease-fighting medicines, drug makers have long employed a laborious trial-and-error process to identify the right compounds. But what if artificial intelligence could predict the makeup of a new drug molecule the way Google figures out what you’re searching for, or email programs anticipate your replies—like “Got it, thanks”?
That’s the aim of a new approach that uses an AI technique known as natural language processing—the same technology that enables OpenAI’s ChatGPT to generate human-like responses—to analyze and synthesize proteins, which are the building blocks of life and of many drugs. The approach exploits the fact that biological codes have something in common with search queries and email texts: Both are represented by a series of letters.
Proteins are made up of dozens to thousands of small chemical subunits known as amino acids, and scientists use special notation to document the sequences. With each amino acid corresponding to a single letter of the alphabet, proteins are represented as long, sentence-like combinations.
Natural language algorithms, which quickly analyze language and predict the next step in a conversation, can also be applied to this biological data to create protein-language models. The models encode what might be called the grammar of proteins—the rules that govern which amino acid combinations yield specific therapeutic properties—to predict the sequences of letters that could become the basis of new drug molecules. As a result, the time required for the early stages of drug discovery could shrink from years to months.
“Nature has provided us with tons of examples of proteins that have been designed exquisitely with a variety of functions,” says Ali Madani, founder of ProFluent Bio, a Berkeley, Calif.-based startup focused on language-based protein design. “We’re learning the blueprint from nature.”…(More)”.