What is Artificial Intelligence? A Primer for Physicians

AI in Surgery: Part II

Ankoor Talwar
11 min readMar 25, 2024

Intelligence is defined as “the ability to acquire and apply knowledge and skills”.6 Artificial intelligence is a field of computer science that focuses on the development of systems that can perform tasks that normally require human intelligence, such as perception, learning, problem-solving, and decision-making. Modern AI models are trained to perform a particular cognitive task very well. In doing so, AI has the potential to improve efficiency, reduce errors, and increase productivity.7

Figure 2. The machine-human interface of narrow AI methods. These methods mimic narrow forms of human intelligence with greater accuracy, speed, and scale, allowing us to use them for complex data.

AI methods, in fields like machine learning (ML), natural language processing (NLP), computer vision (CV), and audio processing (AP), have been developed to mimic components of human intelligence (Figure 2). The advantage of these AI methods over human intelligence is that they can perform their tasks with greater accuracy, speed, and scale. ML is a field of AI that focuses on creating models which learn from data. NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural languages, such as English, Spanish, or Chinese. NLP tasks include text classification, information extraction, translation, summarization, and sentiment analysis. CV is a field of AI which aims to enable computers to understand visual information, including both images and video. CV tasks include image classification, object detection, and facial recognitition. AP is a field of AI focused on computers processing, interpreting, and manipulating audio signals. Some AP tasks include speech recognition, speech development, and music composition.

It is important to understand that these fields of AI can intersect to produce powerful models. In the way a child must learn to understand language, one might use machine learning to “train” a natural language processing model to behave as if it understands language.

The most widely employed narrow AI method in surgery is ML.7 Machine learning algorithms, which train computers on how to learn from data, can be described in several ways. Supervised learning describes when machines learn from annotated datasets to make predictions on new datasets. Of note, these annotated datasets are not necessarily manually labelled. Within surgery, supervised ML has been used to predict adverse events following surgery and measure risk of incisional hernia after abdominal surgery, among several applications.8,9 ML also includes unsupervised learning, whereby machines can identify groups and patterns within unannotated data. Other disciplines in ML include reinforcement learning and deep learning. Reinforcement learning is analogous to operant conditioning, where positive and negative rewards offer feedback for a model to continually learn a task.7 Deep learning describes systems which use multi-layered neural networks to understand complex data. The “deepness” refers to the fact that there are several layers in the neural network between the input and output. This allows neural networks to process even more complex data.

While deep learning is increasingly utilized in healthcare, it is limited because it can lack “explainability”, the ability for us to understand the decisions an AI model makes.10 The inner layers of a deep learning model can be a “black box” to the teams of developers and surgeons who design them — there may be input variables (i.e. “features”) that a deep learning model relies on, but which do not fit a clinical context, thus biasing the model. One way this could happen is if a deep learning model overfits its training data, capturing noise and random fluctuances in the training data instead of the underlying patterns, which prevent it from generalizing to outside testing data. When an unexplainable model analyzes testing data and creates a false output, it is difficult to troubleshoot the error in its architecture. As such, there is a healthy tension in healthcare informatics between using powerful deep learning or other ML methods which are more readily explainable. Importantly, none of these descriptions of ML are mutually exclusive. There are algorithms which can incorporate supervised learning with reinforcement learning and deep learning, as we will soon explore.

When viewed in the context of the data lake, it makes sense how ML algorithms can use surface-level, structured data. As we explore deeper levels of the data lake, we encounter unstructured data that is difficult for machines to process. One cannot simply insert an image, for example, into a spreadsheet or calculator. We must find ways to interpret these alternative sources of data.

One approach is by systematically defining surface-level, structured features from more complex data. For example, our group has attempted to extract hundreds of numeric features from radiology to predict incisional hernia after laparotomy.11,12 While this endeavor has shown potential for future clinical translation, it is very cost-intensive and computing-inefficient. This is where deep learning techniques have taken the spotlight, as the inner layers of a deep neural network can potentially process these complex data. The challenge is actually creating the deep neural network in the first place, which require more data and computing power than traditional ML.10

Traditionally, researchers created ML models for specific tasks from the ground up. This required large training datasets for each model. Over recent years, there has been a paradigm shift in how to build ML models. This shift suggests that one should start with general-purpose systems, termed “foundation models”, and further train them for specific tasks.13 As Bommasani et al. discuss, this evolution allows healthcare teams to “rapidly prototype and build highly dynamic and generative AI-infused applications”.13 In this way, foundation models have the potential to democratize powerful AI techniques for various settings, including surgery.

FOUNDATION MODELS

A foundation model is an ML model that serves as the starting point for further development or improvement. They are deep neural network architectures pre-trained on extremely large datasets. The foundation model can perform generalized tasks, such as language or image processing, with varying degrees of performance in its original (“pre-trained”) form. However, they can be further trained on more specific datasets to improve performance on downstream applications.13 For example, a foundation model for image classification might be a convolutional neural network (CNN) trained on a large dataset of labeled images. This model could then be used as a basis for more complex models that perform better for specific types of image classification. In this way, the foundation model can be thought of as a general-purpose technology. Unlike the past generation of ML, in which models needed to be built for each application, foundation models can be rapidly repurposed for many applications. Further training of a foundation model can be regarded as “transfer learning”, as one essentially transfers the network of a base model and builds layers on top of them to create downstream models.13 “Few-shot learning” is a special case of transfer learning, whereby the model is further trained on only a few cases to achieve high accuracy for a given task.14 This is compared to traditional ML models which require training on many cases to achieve accuracy on a specific task. Figure 3 demonstrates the differences between traditional supervised ML and few-shot learning using foundation models. “Fine-tuning” is a related concept whereby a foundation model’s existing layers and features are adjusted before further training for downstream models.

Figure 3. Differences between traditional supervised machine learning, and “few-shot” learning using foundation models. The latter allows development of AI applications in smaller teams, with fewer resources, and with less training data. In this way, foundation models will democratize AI for healthcare applications.

AI adoption in healthcare and surgery has been limited thus far, as compared to other industries, because health data is sensitive and, therefore, restricted in availability. Foundation models align with the current healthcare milieu because they facilitate development of AI models using small datasets (few-shot learning). They allow smaller groups of healthcare innovators to employ powerful AI techniques.

The first foundation models were large language models (LLMs) for NLP tasks. While there were some several early pioneers, the first major foundation model was Bidirectional Encoder Representations from Transformers (BERT), developed by Google© in 2018.15 This model was pre-trained on the entire English Wikipedia (2.5 billion words at the time) and BookCorpus (800 million words) datasets. Its base model included 12 layers in its neural network and 110 parameters. At the time, BERT achieved state-of-the-art benchmarks on many NLP tasks. In fact, in 2019, Google search engine began using BERT in their English search queries.

While the pre-trained LLMs are somewhat effective in extracting certain information, an LLM may need to be fine-tuned on a relevant dataset to achieve better performance. Researchers have gradually been exploring how LLMs can be further refined for the healthcare setting. For example, BioBERT was a model developed from the BERT architecture.16 It was trained on a dataset of the broad biological literature, and has been effective at tasks such as extracting information from scientific papers and predicting the functions of genes.16

Because of the sensitive nature of protected health information, development of downstream models trained on clinical documentation has been gradual. Alsentzer et al., were able to train BERT on clinical documentation from the MIMIC-III v1.4 database, a free critical care database that includes clinical notes, to create ClinicalBERT.17 Several groups have used this publicly available tool to extract indicators of sleep apnea from scanned sleep study reports, detect critical findings on radiology reports, and identify sentences describing procedures from surgical reports.18–21 These investigations found ClinicalBERT to be a better model of extracting free text information than traditional NLP algorithms.

Yang et al., compared the performance of ClinicalBERT and other models based on LLMs with their own NLP model, GatorTron, which was created using traditional machine learning without transfer learning from foundation models.22 Unlike these other models, GatorTron was created de novo from billions of words from de-identified clinical notes, scientific literature, and Wikipedia. While it demonstrated marginally greater precision and recall ability for some language tasks, it is worth noting how time and resource-intensive it was to create GatorTron in the first place, as it required hundreds of dedicated computers and a supercomputing infrastructure provided by NVIDIA.22 Such resources would be a major limitation for most surgeons and research teams. This is compared to ClinicalBERT which was created as a foundation model and is available for public use and further training. It is plausible fine-tuning BERT, ClinicalBERT, or another foundation model on the training data for GatorTron might create an even superior model — something which we will soon discuss.

While BERT and similar models were key drivers for software engineers to democratize ML for industry, they are limited because they can only classify text and cannot generate information. They are good, for example, at identifying words or phrases, but are unable to holistically understand texts and summarize them. Our current moment in the AI revolution emerged from models which could actually generate new multimedia (e.g. text, image, video, audio). These models have sparked widespread interest and adoption of consumer-friendly AI applications.

Discriminative vs. Generative Models

Before we continue our discussion of foundation models, it is important to understand a key transition in the AI landscape over recent years, from “discriminative” to “generative” models.

Figure 4. Discriminative AI can create boundaries to classify new data points. Generative AI can create new data points which fit a certain class.

All AI models can be categorized as either “discriminative” or “generative” (Figure 4). Discriminative algorithms focus on learning the decision boundaries between different classes of data. These algorithms use existing data to create models which can then predict or classify new data. The simplest discriminative ML algorithm is a logistic regression. There are more advanced methods for non-linearly separable data, such as support vector machines, random forests, and neural networks. To date, most ML applications in surgery have been for discriminative tasks, such as putting labels or diagnoses on images and encounters. For example, the MySurgeryRisk calculator can classify surgical patient risk on a number of postoperative complications with high discriminatory ability.8

On the other hand, generative algorithms, also called “generative AI”, aim to learn the underlying probability distribution of data. In doing so, they are able to create new data instances. For example, while a discriminative model would apply information about the differences between cats and dogs, a generative model more holistically models what a cat is and what a dog is to then solve downstream tasks. In healthcare, a discriminative model might be able to process a radiograph and identify whether there is a mass or not, whereas a generative model is needed to create a comprehensive written report of the radiographic findings.

Any AI applications which produce new multimedia information will use generative AI, such as summarizing clinical encounters, answering patient questions, translating medical information between languages and literacy levels, and creating technical specifications for custom implants.23 Examples of generative AI algorithms include Generative Adversarial Networks (GAN), autoregressive, and stable diffusion models, all of which are types of deep neural networks. Recent healthcare startups have used generative AI to complete clinical documentation for clinical encounters in real time, create synthetic medical imaging, and intra-operatively view tissue perfusion in real time.24–26

Increasingly, generative AI models can also perform discriminative tasks. In the past, these discriminative abilities were weaker than those of discriminative algorithms. However, with the advent of large foundation models, generative models have become increasingly reliable for discriminative tasks, underscoring their importance in the next generation of healthcare-facing AI applications.

In summary, discriminative AI is about identifying patterns in data to make predictions or decisions, while generative AI is about learning the underlying structure of data which allows it to create new information. This makes them ideal for harnessing the complex healthcare data lake, and to power AI applications for surgeons and patients alike.

References

6. Oxford English Dictionary. intelligence, n. In: Oxford University Press; 2023. https://doi.org/10.1093/OED/2404969105

7. Hashimoto DA, Rosman G, Rus D, Meireles OR. Artificial Intelligence in Surgery: Promises and Perils. Annals of Surgery. 2018;268(1). doi:10.1097/SLA.0000000000002693

8. Bihorac A, Ozrazgat-Baslanti T, Ebadi A, et al. MySurgeryRisk: Development and Validation of a Machine-Learning Risk Algorithm for Major Complications and Death after Surgery. Ann Surg. 2019;269(4):652–662. doi:10.1097/SLA.0000000000002706

9. Basta MN, Kozak GM, Broach RB, et al. Can We Predict Incisional Hernia?: Development of a Surgery-specific Decision-Support Interface. Annals of Surgery. 2019;270(3). doi:10.1097/SLA.0000000000003472

10. Ras G, Xie N, van Gerven M, Doran D. Explainable Deep Learning: A Field Guide for the Uninitiated. Published online September 13, 2021. doi:10.48550/arXiv.2004.14545

11. McAuliffe PB, Desai AA, Talwar AA, et al. Preoperative Computed Tomography Morphological Features Indicative of Incisional Hernia Formation after Abdominal Surgery. Annals of Surgery.:10.1097/SLA.0000000000005583. doi:10.1097/SLA.0000000000005583

12. Talwar AA, Desai AA, McAuliffe PB, et al. Optimal Image-Based Biomarkers For Prediction Of Incisional Hernia Formation. Hernia: The Journal of Hernias and Abdominal Wall Surgery. Published online in press.

13. Bommasani R, Hudson DA, Adeli E, et al. On the Opportunities and Risks of Foundation Models. Published online July 12, 2022. doi:10.48550/arXiv.2108.07258

14. Brown TB, Mann B, Ryder N, et al. Language Models are Few-Shot Learners. Published online July 22, 2020. doi:10.48550/arXiv.2005.14165

15. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Published online May 24, 2019. Accessed February 7, 2023. http://arxiv.org/abs/1810.04805

16. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–1240. doi:10.1093/bioinformatics/btz682

17. Alsentzer E, Murphy J, Boag W, et al. Publicly Available Clinical BERT Embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2019:72–78. doi:10.18653/v1/W19–1909

18. Bombieri M, Rospocher M, Dall’Alba D, Fiorini P. Automatic detection of procedural knowledge in robotic-assisted surgical texts. Int J Comput Assist Radiol Surg. 2021;16(8):1287–1295. doi:10.1007/s11548–021–02370–9

19. Banerjee I, Davis MA, Vey BL, et al. Natural Language Processing Model for Identifying Critical Findings-A Multi-Institutional Study. J Digit Imaging. Published online November 7, 2022. doi:10.1007/s10278–022–00712-w

20. Hsu E, Malagaris I, Kuo YF, Sultana R, Roberts K. Deep learning-based NLP data pipeline for EHR-scanned document information extraction. JAMIA Open. 2022;5(2):ooac045. doi:10.1093/jamiaopen/ooac045

21. Kumar A, Goodrum H, Kim A, Stender C, Roberts K, Bernstam EV. Closing the loop: automatically identifying abnormal imaging results in scanned documents. J Am Med Inform Assoc. 2022;29(5):831–840. doi:10.1093/jamia/ocac007

22. Yang X, Chen A, PourNejatian N, et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. Published online December 16, 2022. doi:10.48550/arXiv.2203.03540

23. Chau RCW, Chong M, Thu KM, et al. Artificial intelligence-designed single molar dental prostheses: A protocol of prospective experimental study. PLoS One. 2022;17(6):e0268535. doi:10.1371/journal.pone.0268535

24. Ambience | Your documentation on autopilot. Accessed April 12, 2023. https://www.ambiencehealthcare.com/

25. Activ Surgical | Intraoperative Surgical Intelligence | Med Tech. Activ Surgical. Accessed April 12, 2023. https://www.activsurgical.com/

26. Segmed. Accessed April 12, 2023. https://www.segmed.ai/

--

--