Artificial intelligence and epilepsy: Dr. Christian Bosselmann

What’s the role of artificial intelligence in epilepsy research and care? Dr. Alina Ivaniuk talks with Dr. Christian Bosselmann about the potential uses and dangers of AI in epilepsy, including ChatGPT and machine learning.

 

Sharp Waves episodes are meant for informational purposes only, and not as clinical or medical advice.

Podcast Transcript

[00:00:00] Alina Ivaniuk: Hello everyone. It’s Alina from the YES-ILAE bringing you the episode of Sharp Waves Podcast from ILAE. I’m extremely excited about today’s topic. The timeline of development of artificial intelligence and machine learning extends for decades already, but it had been only recently that it got such a large, extensive discussion since the public release of Chat GPT.

And now it’s one of the topics of discussion everywhere, including the medical and scientific community, and of course, it involves the whole epilepsy field. I was very lucky to get just the right guest to speak about this topic. With me today is Dr. Christian Bosselmann. Christian, could you please introduce yourself briefly to our listeners?

[00:00:48] Christian Bosselmann: Thank you so much, Alina. Such a pleasure to be here. My name is Christian Bosselmann. I’m an adult neurologist, epileptologist and, bioinformatician previously from the University of Tubingen in Germany, and currently an epilepsy precision medicine research fellow at the Lal group in Cleveland.

[00:01:09] Alina Ivaniuk: Excellent, Christian. Welcome, and I’m very happy to have you here with me today to speak about artificial intelligence and epilepsy.

Let’s start with setting the record straight regarding the hyping of Chat GPT. I think it’s a good entry point for such a complicated topic, and also you have a well-written commentary for Epilepsia released just recently. Could you please tell us and describe what is Chat GPT and overall how it works?

[00:01:42] Christian Bosselmann: Okay, so Chat GPT in a few sentences is a large language model. Large language models or LLMs have a very simple and straightforward mechanism: They want to predict the next word in a conversation, so in a more formal sense, they predict probability distributions over strings of text. In a more intuitive sense, they try to answer very simple, very daily problem, which is what’s the likely sensible and specific response in a spoken or written conversation. The answer has to be consistent, correct — if possible, that’s an issue we come to later — and it has to be specific, because of course it would be easy to generate conversations that are very, you know, high level, very non-specific to the topic at hand.

[00:02:37] Alina Ivaniuk: Is Chat GPT the only of its kind or are there any analogs to it?

[00:02:42] Christian Bosselmann: Chat GPT is far from the first of its kind. This has been an ongoing research topic for 50, 60 years now. A particular favorite of mine is Eliza, which was, uh, programmed in 1964, I believe, by Weizenbaum and colleagues at MIT. One of the very earliest chat bots, in the modern sense. This early chat bot already had been designed to emulate a patient and doctor relationship, so in the sense of what’s called Rogerian psychotherapy, where the chatbot deflected back questions or statements back towards the patient without of course any true understanding, but, being the first of its kind to mimic a somewhat real conversation.

[00:03:36] Alina Ivaniuk: So I find it quite interesting that the first, one of the first models emulated the patient and doctor sort of conversation. And now they have gained such a development that it’s reasonable to suppose that they could be used somehow in healthcare and in particular, in helping people with epilepsy.

Do you think that those kinds of models could somehow be applied in the field of epilepsy?

[00:04:05] Christian Bosselmann: That was definitely one of the main motivations behind our recent commentary in Epilepsia. We wanted to quickly provide the readers with some guidance, some additional information that they can have at their fingertips, and that they will hopefully find applicable in their daily life.

Because as you’ve mentioned in the introduction, these models have very much become the subject of considerable hype in the common media. Given these inflated expectations, I think it’s important to consider the expected use case of these large language models. They’re to generate language. So generating language content, text snippets to improve communication is definitely well within reach. The examples we give in the commentary are, for example, generating templates for acute seizure action plans or first aid recommendations or simple explanations of concepts, such as information about expected antiseizure medication side effects.

All of these, of course, have to be subject to revision. So an actual medical professional should look after them, but they’re good as a general starting point.

Therein also lies the dilemma seen in this particular application, which is that communication should be the main priority and one of the main skills of a good doctor. That’s what we’re trained to do and that’s the standard we should hold ourselves to. And that’s not a key competency that we should hand off to models, I believe. There was a publication by Ayers and colleagues in JAMA Internal Medicine, which has caused quite a bit of stir in that regard, where they demonstrated that patients preferred the conversational style of large language models over that of physicians.

They looked at a large social media forum visited by patients and generated responses either by an expert or by a language model. And the results are quite, I don’t want to say disturbing, but they’re quite clear in the favor of large language models, which tended to be more empathetic and appear to be more helpful towards patients.

And that I think should ring some alarm bells, maybe in the sense that we should work on our communication skills, but also in the sense that may there’s something to be gained by also employing language models.

[00:06:42] Alina Ivaniuk: I do agree with you, it should set some alarm bells because it seems that the AI or those models reached the level which they can generate a response that is more empathetic than a human one, there’s definitely something to think about. Do you see any other pitfalls in potentially employing those models in healthcare? And overall, do you think it’s reasonable?

[00:07:07] Christian Bosselmann: Let me just maybe skip back one step towards the potential uses. Communication or generating text is very sort of daily applicable, but narrow field. I think there’s a more general use case for these models, which extends to handling any amount of large, structured text. So I’ve recently become quite interested in research on electronic health records, which as you’re aware, represent a large amount of paperwork. If you have a patient who has been in the healthcare system for 10, 20, 30 years of follow up, who’s been there every few months, there’s a complex disease, for example, then accessing information within that electronic health record becomes critical. You want to be aware of whether you are prescribing an anti-seizure medication that they didn’t previously tolerate that well, or you want to be able to quickly find previous imaging findings, for example.

So this sort of data access and structuring is definitely an exciting possibility for natural language processing. And then, this is where it gets a bit more towards basic research, these large language models don’t necessarily have to generate human language. There’s also quite a recent paper where they didn’t generate English sentences, but instead protein structures or protein sequences, and then towards structures, which has a very wide range possible use cases.

So that’s, I think, really very exciting.

[00:08:46] Alina Ivaniuk: Let’s switch gears and talk more about the basics because with this hype around Chat GPT, now every second application claims that they are powered by artificial intelligence. Could you please explain to us, probably from your perspective as well, what is artificial intelligence, how it differs from machine learning and how would you personally approach naming those technologies?

[00:09:13] Christian Bosselmann: Okay. Yeah, absolutely. And I think this is really important. These sorts of topics have to be demystified. They have to be directed out of their black box towards sunlight, as it were, to be able to have a good conversation about them and most importantly, to arm readers with enough understanding to approach these sort of results in the literature.

Artificial intelligence is a very broad concept, so by definition it’s the ability of a system to perform tasks that require what we consider human intelligence.

So reasoning, discovering meaning, or learning from past experience. There’s of course some leeway in what you would consider intelligence. And when we talk about artificial intelligence in the current context, we’re generally talking about weak AI, weak in the sense that it’s narrow, that it’s trained and focused on specific single or few tasks.

This is very much opposed to the concept of generalized or artificial generalized intelligence, AGI, which is quite the opposite, so non narrow, capable of being applied across a multitude of potentially all tasks, the latter of which is of course still in the future, let’s say. Right? So that’s artificial intelligence.

Machine learning is a subconcept that’s concerned with the how. That is, how do we get systems towards human intelligence? It’s a way to approach artificial intelligence in essence. So we want a system that learns without explicit instruction where you don’t have to code for every single eventuality, but where instead you have an algorithm that makes predictions or decisions based purely on patterns in data.

And I think that’s an important concept to wrap your head around. It’s a lot about pattern analysis and pattern recognition, finding regularities, excluding noise, which also lends itself very nicely to a very visual understanding of the field.

[00:11:28] Alina Ivaniuk: Thank you for clarifying the differences and making it clear for our listeners.

Could you dive a bit deeper, but not that deep, into the specifics of how a machine learning algorithm is developed? What are the essential steps towards developing a machine learning algorithm?

[00:11:48] Christian Bosselmann: Absolutely. So, I think from both my commentary and from the interview so far, you may have taken away that I’m somewhat of a skeptic in terms of machine learning, which could be surprising because that’s my main research focus.

But every time this question is posed, I believe step zero should be to ask, do we need machine learning for this problem? Is a simple algorithm enough? That’s a surprising number of problems out there that can be quite easily solved with linear regression or slightly more advanced stuff like elastic net. So for 80%, 90% of problems, simple statistical models are more robust, more explainable, and much easier to implement.

Step one is 80% of the work, and that is finding or generating and curating data. Finding data that is applicable to a problem that is complete. That’s where expert judgment comes in, finding data that is subject to as few biases as possible, and I think we should, yeah, could maybe talk about the concept of bias and these sorts of issues a bit later.

But generally, you want data that is useful for your predictive problem but still represents data as it would be later encountered. And cleaning up this data, finding sources, that’s a lot of the grunt work as it were. The actual training and testing different models, different parameter combinations — that comes back to the issue I mentioned earlier, which is demystifying. In 80%, 90% of use cases, this is fairly simple to do. There are easy to use packages out there, great documentation, so you do need some basic programming capabilities, but in essence, it’s absolutely doable. It’s not magic.

Where these projects tend to fail is the last step, which is external validation. That is taking the model outside of your very carefully curated, very nicely set up data and actually applying it on different patients at different centers on different parts of the problem for example. This time of validation is the last step on the road towards publishing this sort of model-based research.

But there’s a long stretch of the road that comes after publication that’s generally, sadly, quite neglected, which is the continued deployment and the life cycle of structure model. So, after you release a model out into the wild, as it were, you publish the data, you put it up into repository, you list your code on GitHub, make it open source. That’s not where the work is done. After that, you suddenly end up with having a small child, as it were, something you’ve got to look after, be receptive to feedback from the community. One aspect that’s undervalued and unappreciated is just how hard it is to keep these sort of tools up to date and applicable.

[00:15:07] Alina Ivaniuk: That’s a great summary.

You also give a brief overview of some applications within the epilepsy field. You did give some with your research as well. Uh, any other examples of where those models are used and overall, what kinds of questions related to epilepsy could be answered with machine learning?

[00:15:29] Christian Bosselmann: Happy to. I believe that the interface between machine learning and neuroscience is particularly fruitful.

It’s a great field because we are working with modalities that lend themselves to this sort of analysis. For example, electrophysiology like EEG or other signal-based data, and of course imaging. So we have a method that’s particularly useful for large data problems where we have some human oversight.

And I believe the first dimension really should be epileptogenic lesion detection. I believe that on this podcast you’ve had people from the MELD project, so Dr. Adler, Dr. Wagstyl. Big fan of this sort of application. There are other lesion detection algorithms, like MAP18 by the team of Demarath and Huppertz from Bonn and Freiburg, for example, and lots more.

But I truly believe that imaging integrating multimodal data in the pre-surgical evaluation of epilepsy patients — that’s a fantastic application. Because these are challenges where you have a large amount of data that can be inspected visually, but where there has been a clear demonstration that these very subtle findings, this borderland where you can’t really know what’s non-lesional anymore, how non-lesional is truly non-lesional. That’s where these algorithms shine, so that has to come first.

A second particular favorite of mine is EEG data, both from intercerebral and variable data sources, both for seizure detection and forecasting. People like Philippe Ryvlin and of course, Brandon Westover are doing, and many others, again, are doing fantastic work in that regard.

And I think you’ll see a common theme there. Both lesion detection and seizure detection forecasting are the closest and the most logical starting points for clinical application. These are closest to what matters most, which are our patients. And much in the same sense and last but not least of course, is phenotyping, gathering large amounts of structured data from electronic health records to quantify disease trajectories, both common but also rare disease trajectories, where you really have to leverage the sense of additional complexity and scale that’s provided by having longitudinal electronic health records, and that’s where Ingo Helbig and his group have been doing groundbreaking work in epilepsy.

These are, of course, just very, very few cherry-picked examples. It’s a highly active research field, but I believe these are sort of three core issues that are close to being addressed. These are the most sort of advanced, the most refined applications of this method.

In the future? Well, there’s, what’s the dream about really? Currently, quite a few of these algorithms live in isolation. They may be challenging to implement, tricky in their behavior. So you need technical expertise to actually get them to run, to provide reasonable results for the data you want to give them.

And having these sorts of different tools, united and robust and reliable at your fingertips — that is, at the bedside, is where clinical decision support systems, CDSS, come into play, where you have a shared interface where you can address problems like data from the surgical evaluation, integrating MRI, PET scans, SPECT, EEG, and doing that quickly and easily enough to be actually usable in the sort of high-throughput, basic clinical work we all know too well. The second part of the question: Are there any problems that are not solvable by machine learning? I’d rather flip that question around. Do all problems have to be solvable by machine learning? Not all problems require sort of method.

There’s a lot of applications, a lot of interesting research questions that are small scale, close to the bedside and are in fields where you don’t have large scale data, you don’t have different modalities, you have rare observations, and there’s an elegance in this classic method.

So while of course it’s nice to dream about the possibility of algorithms in the future, I’m still a firm believer in that sense of clinical acuity to recognize and act on much the same patterns.

[00:20:32] Alina Ivaniuk: That’s fantastic. It seems that there is a huge room for machine learning and artificial intelligence, but at the same time, not all the problems require it, and you should be really knowledgeable about the scope of the problem you have and consider methods that you need to solve those problems, and then you can yield something.

Let’s suppose that there is a young researcher who is not a data scientist, but got excited about all those opportunities that machine learning poses and he wants, or she wants to, learn. How do people usually do that and how to get expertise in that field? What would be your advice to people who are stepping onto the road?

[00:21:18] Christian Bosselmann: That’s a really important question, and I believe as with all new methods, it’s really important to have a good environment, and most importantly, a good mentor.

You can’t expect to jump into Weblab and immediately do you know, experimental work. And in the same sense, you can’t expect to open up your laptop and generate great models. The start, as with any new project, is painful.

Machine learning really should not be the first step in this sort of research direction. There’s some grant work to be laid. There’s statistical learning theory, probability, linear algebra, all of these mathematical concepts that don’t come naturally to clinicians.

And that’s not our fault because during our training, we aren’t exposed to these methods. I’d wager that most medical doctors are much more comfortable around biological concepts, cell cultures, animal models, than they are around mathematical concepts. And then the second part, which is equally as neglected, are good programming practices, basics of software development, really laying sort of the foundation of brickwork onto which to build robust and long lasting models. That sounds very bland and very dry, I’m aware. But there’s a danger zone there. Without a lack of statistical understanding, these models become actively harmful. You do have to have significant understanding of how these models truly work.

Opening up the black box, taking the time to really consider what makes each different type of model. Decisions on best models, kernel methods, different types of regression analysis. What makes them different, what makes them tick, how they actually work, their foundational theories, and much more importantly when they fail.

That’s the key issue. When do these models reach the edge of their capabilities? When do their predictions become unstable, untrustworthy? And if you understand that, then you can design a model that’s as trustworthy as it can be.

[00:23:39] Alina Ivaniuk: So there are lots of papers out there, utilizing machine learning. But as you also mentioned, there are lots of pitfalls and not all researchers consider those pitfalls and control for them in their research. Science can fall into that false assumption that the method works and heals something while it’s not, doesn’t. How can people who are not data science experts interpret machine learning based research?

[00:24:07] Christian Bosselmann: Quite right. I think that’s a key competency. I mentioned earlier that what previously was competency in basic statistical analysis should now shift towards also including machine learning models for very much the reason you just outlined, this sort of research output. These papers are becoming more common. They’re also more common in journals that are traditionally clinician oriented.

So when you are in a role as an interested reader or as a reviewer, then you suddenly have to sort of tell apart or tease apart where these manuscripts are robust and where they may have skipped some parts. I think there’s two ways to do this. The first is, or the first thing to keep in mind is don’t get wowed by shiny new methods.

There are some problems where sophisticated algorithms are required. But for most of the things we try to accomplish in a clinically relevant environment, fairly basic, and by now very well-established algorithms work well enough.

Whenever I get exposed to a new manuscript, either as a reader or as a reviewer, I do have a sort of small checklist I work through, and that’s really very basic structure to keep in mind. And it starts with a very common-sense question. What are we trying to do here? What is the AI designed to learn and predict? Are we trying to predict something that’s useful and applicable, and are we trying to predict it in a way that’s direct? Or are we trying to predict a surrogate marker, for example? So are we trying to predict an outcome directly or a biomarker for an outcome?

The second step is really having a good look at the data. How has the data been generated? What’s its provenance? Where has it come from? How has it been curated? What’s the metadata looking like? Is the data available or can it be made available? That’s key. Open science and open-source practices are of central importance, and code that’s open without data that’s equally open is almost worthless. That’s to keep in mind. And then the last thing is bias. So in the way the data has been generated or curated, do you see any potential for introducing bias towards any hypothesis?

The next step, again, is fairly dry. The methods section should be written in a way that makes it reproducible. So you shouldn’t just throw in the model. The data was used to train this and this, but this should include what type of algorithm was trained, let’s say a random forest. What type of parameters that is, what parts of the model can be changed or tuned ? How were they changed? What sort of parameters did they try and what are the metrics that the models were reported on? Especially in medical research, we often encounter what’s called imbalance problems, where the thing we’re trying to detect is rare. Epilepsy is at roughly 1% prevalence, but some of the more rare epilepsies, genetic syndromes can be one in a million, let’s say roughly, and detecting them among the population is obviously an imbalance problem. The true class is exceedingly rare. Then, of course, you need to report metrics that acknowledge this class imbalance.

Accuracy isn’t the main thing that you should be striving for. Accuracy as a metric is fairly easy to trick. Because in imbalanced problems, in one to 1 million problem, of course you can achieve fantastic accuracy without being able to actually predict anything. So you should include methods that, for example, take into account precision and recall, like the methods correlation coefficient, for example, just rather more metrics than too few so that you can get a good view of the model itself.

The next thing I think is what you should spend about 60%, 70% of your time on is thinking about the potential for overfitting. That is, is this model trained in a sense that it learned noise or properties of the training data at the expense of performing less well in the general population.

Overfitting is a key concept to understand. That is where models learn noise and when they’re exposed to data that doesn’t have the same noise, the same substructure beneath the signal that we’re trying to learn, then they suddenly crash and burn as it were.

One red flag that I always try to keep in mind, which is very simplistic, I’ll acknowledge, is to always be critical of models that perform too good to be true. There have been many examples of this in previous literature. Consider, for example, a paper that reports a 99% accuracy in detecting interictal epileptiform discharges on EEG. And from your face, I can already tell that you find that hard to believe and it is, because that’s a really tough problem.

Low signal-to-noise ratio, a lot of noise in general. So this is very much ongoing research. There are three things that may be going on beneath the surface. The first one I already mentioned: imbalanced data sets.

Okay. The second thing is the model is trying to answer a simple question. If the thing you’re trying to predict is trivially easy to predict from the data that the model has provided, then it’s not a relevant question. There are some things that you can predict almost perfectly, but this isn’t really common in the biomedical domain.

And the third thing is something went wrong. So there was some data leakage, some inaccuracies in the way the data was split and used for training and testing that led to inflated test statistics. Okay, that’s, we’re almost done. I know it’s a longer checklist. The last thing I mentioned earlier, but I just want to sort of drive this point on with reproducibility. So published source code including the pre-processing pipeline and a pre-trained model object so that anyone who is actually interested in following up on your findings can just open up GitHub and you’ve made it as easy as possible for them to get the model running themselves.

That’s a lot of work. That’s work that’s not usually acknowledged by journals. It’s currently not usually demanded by the community, but it really should be. That’s sort of the, the cherry on top, as it were.

Okay, so that was a lot to keep in mind. What I usually do is we aren’t in the Wild West anymore. It’s not a wild frontier. There are reporting guidelines, there are checklists for AI reporting that you can work through that outline much of the same issues I just discussed. There are registries, much like clinical trials have been formalized in their reporting with flow charts, checklists, and registries, the same has become available for models.

[00:31:54] Alina Ivaniuk: Fantastic. Christian, thank you so much for your insight. I think that we should have a guide for clinicians on how to interpret machine learning based research. Is there anything that we haven’t talked about that you would just like to mention?

[00:32:08] Christian Bosselmann: As these large language models go through this peak in their hype, do keep two or three things in mind.

The first one is the topic of hallucinations. That’s a concept that I briefly want to mention. Language models can confidently predict a sentence that is factually wrong, and that can be very, very difficult for non-experts to tell. There is no guarantee that this problem will resolve itself with increasing scale or training data. This may very well be a core limitation of large language models, and as long as this issue hasn’t been resolved, these models are not fit for immediate clinical use. That’s sort of the key thing I really want to get out there: to be very careful, very skeptical about these sort of outputs. Don’t blankly apply them. They aren’t medical devices. That sound that maybe a bit too negative, but the potential for harm is real.

There’s a fantastic paper, the author eludes me, that demonstrated that all large language models are biased and they are biased towards people with disabilities. And that’s critical for us because our patients, individuals affected by epilepsy are subject to a lot of bias, a lot of societal bias and stigma, and that’s really a point where you should feel uncomfortable with these large language models, because this sort of inherent bias is insidious, it can really easily creep into text where you wouldn’t expect it. And again, it’s inherent because these models were mostly trained on conversations from the internet. And we do know what humans on the internet can be like.

And maybe as a very final message, we as researchers in this field, and as interested listeners and readers, we do need to keep up with these developments. Make sure to learn what’s usable for clinical purposes and what is not. And we do need to be part of this conversation. These models are currently under the purview of major tech companies quite exclusively, because training them and computational power and the data to train these models is well outside the possibility of even major academic institutions currently.

And as long as that’s the case, as long as these models aren’t democratized, in a sense, we do need to be stewards and vanguards as it were, making sure that if their use seeps into the clinic, that we are aware of how and when they’re used. That’s really close to my heart and one of the main reasons why we wrote the commentary.

##

Founded in 1909, the International League Against Epilepsy (ILAE) is a global organization with more than 125 national chapters.

Through promoting research, education and training to improve the diagnosis, treatment and prevention of the disease, ILAE is working toward a world where no person’s life is limited by epilepsy.

Website | Facebook | Instagram | YouTube

Twitter: English French Japanese Portuguese Spanish

withyou android app