Accessibility and Gen AI Podcast

Denny Vrandečić - Head of Special Projects at Wikimedia Foundation

Episode Summary

Hosts Eamon McErlean and Joe Devon interview Denny Vrandečić, Head of Special Projects at Wikimedia Foundation. They discuss how structured data and knowledge graphs can revolutionize digital information. Denny details the development his current project, Abstract Wikipedia, which aims to create articles using language-independent identifiers. By using symbolic logic rather than the probabilistic methods of Large Language Models, this system allows a single edit to propagate across over 300 languages simultaneously. Ultimately, the goal is to make knowledge more accessible by allowing it to be represented in various reading levels and formats.

Episode Notes

OUTLINE:
00:00 Opening Teaser
00:46 Introduction
01:34 What's The Relationship Between Wikimedia and Wikipedia?
02:02 Denny's Professional Background & Current Role
05:32 What is The Semantic Web?
07:58 Original Proposal For The Web vs The Semantic Web
14:12 When Denny Locked Sir Tim Berners-Lee, The Inventor of The Web, In His Car
15:09 What Is A Knowledge Graph?
19:57 About The New Project, Abstract Wikipedia
30:06 Using Functions vs LLMs for Language Translations
40:19 Using AI For Accessibility
44:14 World Models vs. Large Language Models
51:25 How To Support Wikipedia & Contact Denny
54:13 Wrap Up

EPISODE LINKS:

Wikipedia
https://www.wikipedia.org

Wikimedia Foundation
https://wikimediafoundation.org

Abstract Wikipedia
https://meta.wikimedia.org/wiki/Abstract_Wikipedia

Contact Denny Vrandečić
https://mas.to/@vrandecic

Denny Vrandečić on LinkedIn
https://www.linkedin.com/in/vrandecic/

Denny Vrandečić on Facebook
https://www.facebook.com/vrandezo

Episode Transcription

- As it relates to accessibility, what's your thoughts on the AI? Is it a good thing? Is it a bad thing? Are you concerned? Are you optimistic?

- One thing that I hope is that, you know, with LLMs and Vibe coding and all these things, we will see a proliferation of ideas and prototypes and hopefully some of those things will stick and actually improve the situation considerably. The tech lead on the AppSec Wikipedia team, James Forrester, he's a champion of the idea that once you have the knowledge in AppSec Wikipedia, once it's there structured knowledge, we can use the knowledge not only to make it available in different languages, but also in alternative representations that implement constraints that make the result more accessible for people with different capabilities or backgrounds.

- Hello and welcome to "Accessibility and Gen AI", the podcast where we talk to the people shaping the world of accessibility and artificial intelligence. I'm Joe Devon and I'm joined by my co-host, Eamon McErlean and today we have a special guest, Denny Vrandecic, who is the head of special projects at the Wikimedia Foundation and visiting professor at King's College London. And he is also my friend. Denny, welcome to the pod.

- Thanks for having me.

- Denny, delighted to have you on. Thank you for spending some time with us today. I'm gonna start off with a really, really basic question, and when I hear Wikimedia, I think Wikipedia. Can you share with the audience many probably already know, but for ignorant people like myself, what's the relationship between Wikimedia and Wikipedia?

- That's exactly correct. Wikipedia is the name of the project that is there to build a encyclopedia in many different languages. Wikimedia is the name of both the organization and the movement behind that project and a few other related projects, actually. The Wikimedia movement has a few other projects like Wikidata, Wiktionary, Wikimedia Comments and others as well.

- Could you talk about just a little bit about your own journey and how you transitioned ultimately throughout the years into this role?

- Sure, I can. Thank you. So I did a master in computer science and philosophy at the University of Stuttgart, and then went to the KIT, the Cultural Institute of Technology where I earned my PhD. I was a Wikipedian even before I started with my PhD. I've been at Wikipedia now for more than 22 years and was also at some point elected to the board of the Wikimedia Foundation for one term. During my PhD, Markus Krötzsch, a fellow graduate student in Karlsruhe and I developed the idea of a semantic Wikipedia of bringing explicit knowledge structures into Wikipedia and presented it during the very first Wikimania, which was in 2005. This led to the development of Semantic MediaWiki and once I finished my PhD, I was offered opportunity to make the semantic Wikipedia idea a reality. This is what then became Wikidata. After Wikidata, I received an offer from Google that was too good to refuse to work on the world's largest knowledge graph. And so I joined Google and had a a few really fun years there. And I returned to the Wikimedia Foundation afterwards to work on my current project, abstract Wikipedia.

- Do you mind if I ask like that whole semantic idea, where did that originate from? Like why did you push to ensure that Wikipedia had that structure?

- It was mostly just a fortunate coincidence actually because Marcus and actually the research team that I was part of in Karlsruhe was a Semantic Web research group. So our topic was Semantic Web technology, structured data, all these things, ontologies and all these things. And then we got, we saw the call for Wikimania and we were like, oh, we just want to join that. We want to be there, we want to meet other Wikipedias and so on. And in order to make it part of our work, we tried to somehow combine our work topic, the Semantic Web with Wikipedia and to see what could come out. It was just meant to be an idea. We didn't actually mean to implement this, we were just like, let's put this idea out in the world and maybe someone will implement it. And in fact, it was a company, DocCheck, which has become quite a big company by now that has started implementing Semantic MediaWiki. Marcus and I didn't know PHP, we had no idea how to actually implement that. And so DocCheck started the first version of Semantic MediaWiki. We saw it actually work and then we took over, we learned PHP, we took over the development, and we're stewarding it for several years. It's still an active open source project that is being maintained and led now by people like Jeroen De Dauw and others. And it's great to see this project continue to flourish, but it was just like we wanted to go to Wikimania. We did never think about the possible implications of that to actually become reality, to be honest.

- Sorry Joel, I'm completely monopolizing this.

- No, no, no, not at all. Go ahead.

- For the end users, I just wanna get that or for the listeners, I wanna get that foundational and my last question before I be quiet for a few minutes is when you mentioned that term to individuals that maybe are not technology driven, semantic, Semantic Web, how do you describe it to them?

- The Semantic Web is an extension of the existing web that allows for structured data to be published and used from the web. So that people could just pull the data from the web, use it in all kind of places that can be, you know, opening times for business, this can be information about event, where is it happening, when is it happening? It can be metadata about accessibility of a website. Also podcasts like this one is actually in fact working because you publish Semantic Web metadata about a podcast in one place, probably a website or some other, and then all the different podcast platforms take a structured data, read it and pull in the audio stream and are able to actually connect those things to make it part of a series of podcasts and so on. All of this is happening because this is using a Semantic Web standard in order to describe the podcast. So there are a lot of places where Semantic Web technology is used today in order to provide structured data on the web and it kind of turns to worldwide web in addition to what it is already, you know, this human readable worldwide web also into a giant global graph with a lot of knowledge that is structured and connected with each other.

- A lot of people don't realize the connection like that. Sir Tim Berners Lee who invented the worldwide web wrote a book, "Weaving the the Semantic Web" and how the first version of the web was about making pages that are human readable from humans to humans and then the second version, I'll let you explain what it is, but essentially he is the father of both of these and the entire Semantic Web community from what I understand being, you know, when I met you being in it a little bit was AI related and all the AI folks kind of gave up on it and said, let's try this new thing, Semantic Web and try to market the Semantic Web concept. And the marketing of it didn't work as well, but it seems to be powering a lot of things, like the knowledge graph at Google, schema.org. So I gave you a lot of words soup there, but hopefully you can pull out, speak to the parts that you think might be interesting to an audience new to this.

- Sir Tim Berners Lee indeed not only invented the web as is widely known, but also was the first one to start talking about the Semantic Web and brought in other people onboard to actually start building it. And the interesting thing is that if you look at the original proposal for the web already, I think 1992 or 1993, it might be even 1989. So if you look at the original 1989 proposal for the worldwide web, it already includes the ideas for the Semantic Web. It describes a web that is not only about documents, but it's also about entities, about things which are connected with properties and so on. And the web, the Semantic Web isn't much more than that. It's a big graph where you use, where you have nodes that describe things that are standards for things, which are connected with properties which have a certain meaning, both the nodes and the properties are given with URLs, the same thing that we are using for serving the web. You know, you have something like https//en wikipedia.org/wikiaccessibility and you notice this now the wiki page, the Wikipedia article about accessibility. And you can use this as a stand in to talk about the concept of accessibility. So you could for example, say something like, you find some UII that stands for Joe Devon and you say expert in, use a property for expert in, there might be some CV ontology, whatever that provides that, and then you point to the accessibility topic from Wikidata or Wikipedia and you can then make this kind of statement in a machine readable way, which is accessible then to many different languages and which is part of a Semantic Web and can be crawled from there.

- So just to go a little deeper on that and maybe make it a little more real for folks, I always love to use the example, and we talked about this before we started, I used to think when you're, I used to work at a search engine company by the way, back in the day that was around before Alta Vista, let alone Google. So this is a fascinating topic for me. When you're searching for Michael Jackson, there was a radio talk show host in Los Angeles by the name of Michael Jackson. But if you type in Michael Jackson radio, you're almost certainly going to get Michael Jackson, the singer and you want to be able to disambiguate, right? And so that is where the URI, which is similar to URL, but is an identifier comes in, and the same as property. So can you explain how that same as property would allow you to disambiguate if you're using Semantic Web search as opposed to natural language search?

- Yes, so this was popularized a bit wider by... But the Google Knowledge Graph was first introduced. I mean this is a concept that is basic to the Semantic Web and it was there already, but Google had this very good tagline that said it, things not strings. So thing, so it wasn't just, you know, a word like Michael Jackson, but rather they had an identifier, a unique identifier which would say, this is dead Michael Jackson that we're talking about, this is dead Michael Jackson that we're talking about. Today, one of the most widely used identifiers are actually the Wikidata IDs. So Michael Jackson, the singer for example, would be Q2831. I don't know that from the top of my head, I just looked it up. So you have Q2831 saying this is Michael Jackson, the singer, whereas Michael Jackson, the radio commentator would be a much longer queue number, 6831566, and then you know exactly which one you mean. And whenever you have some information that you want to touch, a document that you want, for example, index and say, well in this case this document should be indexed with the radio personality, not with the singer, you can use this unique identifier. So inside of Google, the index was switched from a string based index to an entity based index so that the quality of the search became much better because you could just disintegrate in search. This was around the time, 2012, and this was introduced to Google that you could actually disambiguate very effectively in the search engine if you were looking for one or the other topic.

- It's amazing how far we've come, like I know it sound like an old man, but I remember back in 1998 using like Netscape and it just fascinated me the the wealth of knowledge you could find. And did Google back in those days not have like, do you feel lucky or I'm feeling lucky right below there? It's wild, it really is. And I believe Wikipedia started three years later, was it 2001?

- Yes.

- I fell in love with it right away, Loved it, absolutely loved it. I did.

- Yeah.

- I couldn't believe there was one place you could go to, like a universal library if you will. On a sidebar, I used to host quizzes, pub quizzes all around there and I used it every week. I really did. I absolutely loved it, yep.

- I'm so glad to hear it. Yes indeed, it was 2001 when Wikipedia launched. We are celebrating the 25th year of Wikipedia this year and very happy about where the journey has been taking us so far and we're very excited about where it's going to go in the future.

- And it's amazing that in all this time that Sir Tim Berners Lee had that concept for the Semantic Web so early on and was patient the way that he set the whole thing up in an attempt to keep, you know, to keep this so that it's for the individual as opposed to, you know, really centered in large corporations, which didn't quite succeed I think the way he wanted it to. But I'm wondering, just as a fanboy, have you met Sir Tim or what do they call him? Tim Bill?

- Tim Bill, yes. But when you meet him, you just say Tim, you don't say Tim Bill, yeah.

- I'm sure

- I've met him a few times on the World Wide Web conferences and other opportunities and it was always a pleasure to chat with him. I always like to tell a story how I accidentally locked him in my car once when I was driving him to a restaurant. It was in California and I was using my own car and in the back, I usually had the child lock on because my daughter is sitting there. She was, I dunno how back then, like three or four years old. And so I had him sit in the back and when we arrived there he was like, hmm, where's Timm? Oh, oh right, I need to let him out.

- Oh wow, that's your claim to fame.

- Denny, you mentioned knowledge graph previously.

- Yes.

- Can you explain to our listeners what that really is?

- So knowledge graph is... It's like a Semantic Web, but for a single organization or a single topic or something like that. So you can think of like, you know, on the web you have a lot of websites with many web pages and so on and a knowledge graph is like a website for the Semantic Web. So you would have, well the biggest open knowledge graph that is currently there is Wikimedia's Wikidata, which is the one that I started in Berlin in 2012. And there are many other knowledge 'cause the most famous one, the first one when it wasn't a generic term yet was the Google Knowledge Graph started, which came out of Meta Web and Freebase, which they have founded and then was acquired by Google. But today, basically all of the big companies have knowledge graph in there like Microsoft has TORI and many others. And the idea is that knowledge graph is a large shared database within such an organization, as I said, or within a domain. So you can also have a knowledge graph that is, for example, done by the national libraries, which are then connecting with each other together, sharing a lot of data with each other and so on. And because for national libraries, for example, you often have offers in common, right? I mean you have a lot of books that are written by a person, like, I don't know, Maya Angelou or others and which will be in a lot of national libraries available. So they want to make sure this is the same person. And then Wikidata actually has identifiers in all those different national libraries saying, okay, this person is identified by the British National Library with this identifier by the Germanist library like this, by the Library of Congress like that and so on. So you have this huge kind of Rosetta Stone connecting all this different knowledge across into one big one knowledge, which is then basically the Semantic Web, which is the one huge knowledge graph out there. And knowledge graphs are used inside of organizations so that they build on, to share the data easier. So they built on shared understanding within the knowledge graft. So you have one list of countries, of cities that you refer to, of job roles in the company of customers, products, whatever it is that you have, and then you can share data from different parts of the organization easier by referring to the terms in the knowledge graph or even putting the data directly into the knowledge graph.

- Got it.

- Yeah, so instead of negotiating point to point between the different parts of the organization, you have one place where you all work together on a common vocabulary and then this helps a lot with improving the data quality and also inspiring data reuse.

- Yep, makes sense.

- And actually ServiceNow also has a knowledge graph I'm sure because they acquired Juan Cicada and Dean Allman's company, I think it's called Data World, right?

- We do indeed, yes, we do indeed. We have been in the knowledge graph path for many years now across the board. Our data fabric, as Bill rightly calls out, is key for us. And having that cohesiveness with the customers that we work with across multiple departments, that data is gold for our customers. We try to ensure that they can tell that holistic journey across their entire business and tie everything together ultimately, yep.

- So this kind of helps, I think a lot of folks understand the underpinnings that power all of this, but it's a bit of a shame that the Semantic Web is not as well understood by larger audiences. So thank you for joining us and helping our accessibility audience and AI audience kind of get a clue into this world. Now you have started Wikidata, which you've mentioned, and now you're doing Wikifunctions, an insane project. When you first described it to me, I thought it was insane. Then as I started to learn more about it, I still thought it was insane and I still think it's insane, but can you start by explaining what problem were you trying to solve and how are you solving it? Yeah, I have no words.

- So it's a two step project. So first we have Wikifunctions, which is a new Wikimedia project, like Wikidata, like Wikipedia. And it's a project to create a library of functions, functions, those are things that can compute all kind of things, you know, including answers to many kind of questions like what's the population density of a country if you have the population number and the size. how many days have passed since Wikipedia was founded? And a lot of functions that generate natural language text. So you can put in some information like the birthdate and the person, and it would create you a sentence, like saying person was born on that date.

- So functions with like variables?

- Yes, exactly, so with arguments.

- Yeah.

- So the main goal of Wikifunctions is to support abstract Wikipedia. This is the actual thing that we're working on. Abstract Wikipedia aims to close knowledge gaps in Wikipedia. So Wikipedia is not only English Wikipedia, it exists actually in more than 300 languages. And currently the content across the languages is very much independent of each other. If you add a fact to French Wikipedia, there's nothing that makes it this fact propagate to the Hindi Wikipedia.

- Oh, okay.

- In AppSec Wikipedia, we aim to create shared articles, which you write once and then, which is written a language independent way, and then those articles are brought with high fidelity into many different languages. And then people can just read those articles. They won't necessarily even notice that it's actually coming from AppSec Wikipedia, which just look like their Wikipedia has become-

- So it's truly localized. You mean the mouth is just truly localized? Yep.

- Exactly. So the articles are in their language in their Wikipedia, but they're all coming from the same content. They're maintained only once, you have to update it in only one place, and it propagates immediately to all the languages that participate for that particular article. So contributors can work together across language barriers on a shared article and this really aims to help the smaller language additions in particular where we currently have a lot of knowledge which is still missing. So a lot of the smaller Wikipedia articles are, Wikipedia language editions are missing a lot of basic articles. The whole project is developed with the most important principle of Wikipedia in mind, that knowledge is truly human. The text is not created based on some trained probabilities of an LM or whatever, but it's entirely under control of the volunteer community. They control the abstract content, they control the functions that translated into language and so on. And if anything is wrong, someone can click on an edit button and fix it and then it's fixed. It's like not with an LLM, like where you tell it, oh, you made it wrong, but who knows if next time it'll remember that you said that. So humans are and remain at the center of Wikipedia. This is not something that just creates articles out of some probabilistic pattern that it learned from the web.

- So like what would that source be then? Would the source be English and then you try to propagate that across all languages? Do the localization or what's that source?

- No, the source is actually just identifiers. I don't know-

- Just identifiers, okay, okay,

- Yeah, it's the same thing in Wikidata. So Wikidata is also a cross lingual project. You can edit it in any language. You can edit it in German, you can edit it in Hebrew, you can edit it in Arabic, but you always editing the same knowledge graph because in the end it's just identifiers, whatever it was for Michael Jackson, the Q2388 or whatever, and properties connecting it with each other. But when you go to Wikidata, you see it in your language because each identifier has a label in your language. So we know that Q2831 is Michael Jackson, so we just write Michael Jackson there. If you're looking at it in Arabic, you will see Michael Jackson in Arabic script and so on. So you can edit it in your language and it updates the common shared knowledge graph immediately, it's available in all the languages immediately. And the same thing will be true for abstract Wikipedia. You will be able to edit it in any of the 300 languages of Wikipedia because it's the same shared content there. It's not English as a base language here.

- Yeah, I lead globalization at ServiceNow as well. So from a translate localization standpoint, I do get it.

- So I think some folks might still be a little bit unclear on how this works and it's difficult to explain sometimes because like, it would be great to differentiate the LLM, the LLM's way of doing things versus how you're doing it here. And I'm also wondering if AI is a part of it, but perhaps it would be good to explain how the LLM uses probabilities and looks at, just uses brute force against a mass, against the entire internet. Every book that's been written, text everywhere, and then uses probabilities and it doesn't really understand anything, but it just figures out what's the next word, as opposed to how you're doing it. Like when somebody comes in and they wanna add the mayor of Carlsruhe, like what, do they just type natural language or do you force them into a syntax or do you just figure it out in the background? And is that AI? And sorry, I'm gonna add one more thing, how is Semantic Web dealing with facts as an understanding as opposed to the LLM, which is just probabilities?

- So the interesting thing is that we are basically having a completely symbolic approach to capture information. Like for example, the mayor of Carlsruhe or something like that. So it's not just a probabilistic method saying, if you ask for the mayor of Carlsruhe, then let's auto complete to the current mayor or whatever. Hopefully if we learn it quick enough. I mean if the mayor changes, you always have like the trouble, like how do we actually update it kind of knowledge and so on. They're getting better and better solutions for these kind of problems. But AppSec Wikipedia and Wikidata are much closer to a traditional database. If something changes, you just change a field in the database, then we use that structured data in the database with functions to actually create text output. And you see that this is a very deterministic step. All of these things that are happening are not being randomly done, but you know exactly why something is happening. Why is this sentence being written? Why is this name being written here? And it's not because you set up some temperature a bit higher in an LLM or because it was trained a little bit weird or whatever. I mean, with LLMs, I've done experiments that if I ask it in different languages for factual knowledge, I get different answers, which is completely ridiculous. This is something that wouldn't happen with a structured database, obviously, because you have the facts once. It can still be wrong. I mean, don't get me wrong, it's not that, you know, only LLMS make mistake. Wikipedia has tons of mistakes, right? Wikidata has probably, I don't know, many, many mistakes. But the thing is you can click on edit button, edit it, fix it, and then you know it's actually fixed. There's a history of the thing stored and from now on, the new fact is being used. Obviously you can also use it for vandalism and so on and then hopefully someone will come quickly back and fix it again, but you know exactly why this answer is given. If an LLM mean, if I'm chatting now with ChatGPT or Gemini and I tell it, oh no, this was wrong, and you go tomorrow and ask the question, it's like, I don't know that it'll actually propagate this information there because I mean, it's just, as you say, no it doesn't.

- It does not.

- And we might at some point come to a world where they actually have also knowledge models, not just language models, but here we have the knowledge graph is basically something that humans can maintain, that humans can update, that they can validate, that they can audit, that they can check, and all of these things. You have explainability built in, which is a huge problem with LLMs. Now having said all that, I'm not bashing LLMs. I totally don't want to do that because the interesting thing is, as you said, you need to learn this kind of a new language in order to write for AppSec Wikipedia. And why would we? I mean, this is exactly what LLMs are really good at. If you just write a natural language, well, we can translate it to the AppSec language and we still have you there. So the idea is yes, we use LLMs here, but only in order to support the contributor when writing to AppSec Wikipedia. Then we show the result of that and the contributor has still to say, oh yes, we always have human in the loop in this case that checks actually that the LLM has understood the contributor correctly and confirms that this is indeed the content. So we are not just making stuff up, but rather we are taking the input of the user, translating it to our AppSec representation, and then asking the user again, is this actually what you said or not? And then they can fix it or they can say, yes, this is correct, or whatever. So we have always this human in the loop when creating the knowledge, but for displaying the knowledge, we are not using any probabilistic methods there. We're not using anything that is, we don't throw dies there.

- So you don't like for displaying the knowledge from a language perspective, do you use LLMs to do the localization or translations for if it has to be translated from English to French to Swedish?

- No.

- No?

- No, no, but that's exactly what we don't do. We use functions here that the community writes in order to create natural language text output here because we don't trust the LLMs. We had, we are testing it regularly and it's just, LLMs love to make mistakes here and because it sounds a little better when you make it like this or whatever. Just to give an example that has happened quite recently. We had information about two people who were siblings. So they had shared parents in the structured data, they had a shared place of birth, they had each a date of birth and so on. And the LLM made a really nice readable text where they said, well, person A grew, grew up together with the siblings in that and that city. And it reads nice, but actually the structured data never said that they were growing up together, right? I mean, it's likely, it's probable, but if it didn't happen, what do you do now? How do you go and fix it? Like if I material is a text and then fix the text, I have lost control over this because it doesn't propagate again to the other languages, right?

- Sorry, why would it not propagate to the other languages just because you missed the text?

- Because the text is always being generated from the structured knowledge base.

- Okay.

- And if the structured knowledge base doesn't include, it didn't include the information that they were growing up together in the first place, but the LLM just added it, hallucinated it to edit there, there's nothing you can remove in order to fix it. I mean, you can go and start fixing the text itself, but then we're just having the same Wikipedia that we have right now where you can just go and add it to Wikipedia and you lose the connection between the different languages. So if the structured knowledge doesn't reflect what is in the text, we lose this ability to share the article across different languages basically.

- Yeah, I'm still missing that like TMS piece, the translation piece of where it's ambiguous, it's language ambiguous basically it sounds like.

- Language is always ambiguous, you're completely right. So we're not translating from English to French or from German to Arabic, but we force our contributors to actually write in an unambiguous knowledge representation. And this knowledge representation then gets turned into language. So we have, for each language, we would've functions that say, okay, this piece of knowledge representation in this language is being represented by the following sentence. So they would have that information. Like for example, so we might have a abstract sentence saying, Wikipedia founded 2001 and then we will create a sentence in German, like, we would have a sentence in Croatian, which would say, I miss my Croatian.

- Yeah and when does that happen? Yeah, no, but Denny, where does that piece happen at? You say, you know what, we'll have a sentence in German, yeah, we'll have a sentence in Croatian.

- Those are the functions that are stored in Wikifunctions.

- The different functions. Okay, so you can get,

- Exactly.

- okay, yup.

- You actually write a function maybe in Python, maybe in JavaScript or whatever that actually creates the sentence. And so we all those articles would be then completely created by functions from the structured data.

- And then you'll have a separate function, you'll insert another variable and that variable could be the actual language that you want have the output in?

- Exactly.

- Got it, yeah, that's, yeah, that makes it truly generic,. truly, truly, truly generic at the start and the end, the variable will like tee up the output. Yeah.

- Yes.

- Let me frame this also in another way that might help viewers as well that frame the question a little differently. So in an LLM, first of all, the training is really important. The training date of the LLM dictates most of what's happening in the LLM. You have some post-training things, but essentially if you are, if the date of the training is let's say January 1, 2026, and December 31st, 2026, the mayor of Carlsruhe went from Peter to John, most of the text in the LLM, in the LLMs training is going to have Peter in it and there might be just a few entries for John because that data came in later. So when you're typing in an LLM, because it goes by probabilities, it will take the most common name in their training data. And what you are doing different and the way Symantec web works is that you might have Carlsruhe which has a URI, which is just like, a URL can be a URI. So here is the page about Carlsruhe and that's a fact. Here is the page about the person that is the mayor of Carlsruhe and that's a fact, and then you have the concept of mayor and that's a fact and you're tying those three together. And then when you're, when you have a new mayor, you're changing that. And so I think the part that I find a little bit unclear with the abstract Wikipedia is what if there's a name that's not in there already? How does that user say, oh, there's a new mayor. You essentially have to create a new URI for the new mayor and then say that that mayor is the mayor, this person is the mayor of this city, and now it's a fact that points to three different URIs and Eamon, you can, since I've known Denny for 20 years and been familiar with this, I just want to make sure that this question is clear, the framing is clear, or if we should go into more detail here.

- I get the principle of it. I think there's still a level of complexity that it's difficult to understand why we would use this as opposed to LLMs.

- Right? So I try to get to an answer to that. So my suggestion and something that I've been raising for a while now is actually not to use it instead of LLMs, but rath-

- In conjunction with?

- In conjunction with LLms, exactly. LLMS are language models. They're great with dealing with language and so on, but we're not using them only as language models right now, we are also using them as knowledge models. We are training on many, many texts. And now our LLMs, they remember all the capitals in the world and of every state and everything. They know all the characters of "Game of Thrones" and know how old the characters are. They have entire books memorized and they have basically all the Beatles song lyrics memorized. This makes those LLMs huge and you have to have trillions and trillions of token to train them, right? Do we actually need them to be this big? We could take the knowledge out of the LLM and just leave the intelligence in the LLM. The LLM would would be the most important part of the infrastructure still because it's actually extremely intelligent. But it should know that, you know, instead of looking up the birth date of Pope, sorry, instead of memorizing the birth date of Pope Leo 3rd, I just look it up. I don't have to have weights in my model that tell me on which date Leo 3rd was born. This is completely ridiculous. We don't have to memorize all these kind of things. We should push the knowledge out of the language model, for example, into a knowledge graph, and we can push out more knowledge, for example, in the documents. We can push out other knowledge into functions, for example, that calculate things. LLMs have first being doing math by actually memorizing the results, which is also ridiculous. But now much math is doing better. And what they're often doing is actually they write a piece of Python code, run the python from code and then you get the result, which is exactly the right approach here, right? They create a function on the fly, get the result and use that. This is great. You don't have to memorize all the results of a possible function call and...

- It's basically condensing that mass amount of data, taking the key pieces out of that mass of data, condensing it, both from a cost efficiency perspective and even the performance of perspective over time.

- Yes, I want much smaller language models and I want them to be denser, I want it to be more in, I want them to focus on the intelligence and put the knowledge into a, for example, a knowledge graph, which you can then edit as a human, which you can then maintain and not even as a human. Actually an LLM can also edit and maintain the knowledge graph, right? I mean, there's no need that we have humans who can do it, but you should be able to confidently say, okay, I changed the mayor of Carlsruhe now and it is not a new mayor, or I added a new mayor from that date and now we know this and I shouldn't like write 500 texts that have this information, train the LLM and hope that it worked. That's not how we should be doing these kind of things.

- When we talk about LLMs and AI, one of the questions we've asked all of our guests, Denny, and it's, you know, it's a pretty open-ended question. As it relates to accessibility, what's your thoughts on the AI? Is it a good thing? Is it a bad thing? Are you concerned? Are you optimistic?

- I think that, you know, structured data and screen readers and other assistive technology, they really fit well together, right? I mean, this is something that should be very complimentary and help each other. If we can resolve ambiguities, this can help in many different accessibility use cases and so on and we can help it make easier to understand digital content. So I think there are a lot of great opportunities which are somehow not yet materialized. One thing that I hope is that, you know, with LLMs and vibe coding and all these things, we will see a proliferation of ideas and prototypes and we will see a lot of new ideas being shown tested out that we can see what can happen and hopefully some of those things will stick and actually improve the situation considerably. Another thing is that the tech lead on the AppSec Wikipedia team, James Forrester, he's a champion of the idea that once you have the knowledge in AppSec Wikipedia, once it's there structured knowledge, we can use the knowledge not only to make it available in different languages, but also in alternative representations that implement constraints that make the result more accessible for people with different capabilities or backgrounds. So for example, we could have Wikipedia articles with different reading levels or for people who have certain background knowledge where we can leave certain things out of the article, make it more dense, make it more, and for others we can make it less dense. My favorite example here is, I was at one point reading the article about daisies, the flower, and honestly, I didn't understand anything in the article. It was so full of complicated. I have a PhD, not in biology, and this means I don't understand it, but I couldn't even read. It's like a flower that's usually white or whatever. It was all those very, very difficult biology technical terminology that I didn't really understand at all. And I think that's something where we can, hopefully in the future then like switch it and make it a bit easier to read. And it's not only for... I mean accessibility goes in many different dimensions, right? It can be also make it easier to read for kids, for example, or for people with different-

- So universal, yeah, yeah, yeah, yeah. You know, it's the one thing I have to say about Wikipedia, one of the reasons why I love it, and it's also amazing thing, always an amazing thing to be able to make something really complex that has a lot of data or content behind it, make it look and feel really simple. And the one thing I love about Wikipedia is it's simple. Like you hit the key information, you hit the key dates, you understand the history, it's very intuitive, it's very digestible on anything you look up. Now there's a lot of work I'm sure goes on, on what we just discussed to make that and generate that simplicity, but kudos to yourself and the foundation for doing so because I just think it's a phenomenal site and it's a phenomenal way and mechanism for us to continue to learn. I love it, I do.

- Thank you, but I have to redirect the kudos to the community. As a foundation, we are really not responsible for the content or for the language level or anything. This is all done by the community of volunteers who are writing the articles, who are deciding on what the language levels will be and all these things. So we keep the service running, we look at trademarks and stuff like this, but the community is really the magic piece that keeps Wikipedia, that made Wikipedia what it is.

- So we have a just a little bit of time left, so I'm gonna try and maybe make it more technical to end off. Sorry everybody, but it's super interesting. So Yann LeCun and Fei-Fei Li have all been about world models, that LLMs and Yann in particular keep saying that LLMs are going to hit a wall. And a lot of what you've been talking about is removing the facts out of LLMs and to use structured data, perhaps semantic. I don't know if this is the same concept as the world models that Fei-Fei Li and Yann LeCun are building with I think it was called JEPA, but how do you see this working? Is there a new way to do pre-training of these models? Because it's absolutely insane that it even works. When you learn how these LLMs are pre-trained, it is brute forcing just massive amounts of data and just moving around these weights by tiny amounts. It seems just insane that it works and it seems crazy to even do this. It's amazing that they push, that they spent the money that it takes to do this. So how do you see pulling those facts out, turning this into an understanding and a world model? And will that also improve the amount of money and the brute force quality of LLMs? Like maybe the GPUs will be less important and maybe it'll be more accessible. By accessible I mean like more open source, more open weight, more possible for people to train different kind of models if we're not brute forcing this like massive quantity of data.

- I'm not an expert on world models. My rough understanding is that world models are a model of, well, the environment where an agent is moving in and meaning the actual world. So it's more like an embodied agent, like a robot or similar things to understand the world or to better connect video input with actuators in the robot and similar things. So it's really about putting it in the world and so on. This is not what I usually mean with a knowledge graph because this is kind of a different level. I mean, you would say Paris is the capital of France, this belongs in the knowledge graph, but oh, there's a table in my way or the way physics works, basically. This would be more encoded in the world model and trained there. And I think that world models are a pretty obvious and very important extension for LLMs because at least humans, we are born with... We are born with something that makes very quickly a world model where we learn naive physics and so on. The work of Peter Heyns comes to mind in this place to have this kind of naive physics representation in us to be able to, not do some of the silly mistakes that LLMS currently do. And I think that work models will be really, really helpful with that. What I'm saying is on the same line that I also think we're gonna hit a wall and we are training on far too much data. Those models are far too big because they are encoding so much knowledge. That's what I was meaning earlier. And this knowledge really should be moved out of the LLMs. Those should be in knowledge models or knowledge graphs or whatever, which are much more efficient to display this knowledge then have the language model coordinate between these different sources of knowledge that we have. And I think both of them are going the same way. Amusingly, for example, recently we saw that Open AI and Perplexity are both going to the shopping domain and I would've expected, you know, oh well, they have those lamps, they just go to the website and buy something and that's it, right? But instead, they push for standards that describe the things that you can buy, for standardizing the APIs so that you can pay them APIs for giving money, for buying things, for making orders and so on and for describing the things that you can buy. And this is classical sematic web approach, you know? You decide on a format, you agree on APIs to share and all these things. It's like it's funny to see a company like Open AI go and do basically the same thing that has been happening for decades, right? In the classic symbolic approach, which will in use of them, but this is the hybrid thing that we're talking about. It seems that if they want to get shopping in a reasonable timeframe, to basically have at some point move to the symbolic approach in order to get the confidence you need, that the buying of the thing that you want to buy is actually happening, that you're not buying the wrong thing, that you're not buying the wrong money and so on and you don't want to leave that to a probability system. You want to have a classical system to do that.

- Yeah, you're writing software at the end.

- Yeah.

- But to understand better what you were suggesting, like could you give an example of the type of training data that would be good for the LLM so that it maintains its basic language capability and the kind of document with the facts that should be removed from the training data? Because I don't know how you split that out. Like I'm trying to visualize it and I can't.

- Now no, I don't have an answer for that one. This is something that someone else has to figure out. I don't think that you have to remove the facts from the training data, it's more something like in the end, you condense the model as you do, for example, when you take a huge model like a Llama did when they did Llama 4 and you had Behemoth and then you'll translate it and then you condense it to the smaller models that you can run on cheaper hardware and stuff like this. And when you condense the model, you have certain decision points to make, like what do you actually leave out and similar things. And this is the place where we can then decide effectively to say, well, let's not keep the date of birth of Douglas Adams in a language model. Let's not keep whether it was a Tuesday or Wednesday when he was born or whatever. I mean, all of these things would otherwise be encoded in that thing and we can just, you know, kick that stuff out and focus on the stuff that actually encodes intelligence. But the separation of knowledge and intelligence is nothing that's trivial. So my suggestion isn't something that you can just sit down and immediately implement it. You still have to get a few good ideas and know how to actually make that and so on. It's nothing, it's not a ready recipe.

- It's a great way to say that, explain that, Denny, the separation between knowledge and intelligence. Yep, it is. Anything you'd like to share with us before we wrap up, Denny? If any of our listeners wanna get in contact with yourself or donate to the foundation, how would they go about it?

- Oh, if you want to donate, I would be very happy if you would donate to the foundation. The foundation is run completely on donations, mostly on small donations. And we are very, very happy about it, we can still do that in 2026 and that we don't rely on large donors that we get, yeah, exactly,

- Large, yeah.

- that we can then dependent on. And we really would love to continue keeping this kind of independence from our donors. So if you want to donate, just go to Wikipedia. I hope you know the domain and there will be a link to donate. We are also having a few fun merch products this year, like mascots and similar things. Other than that, if you want to connect to me directly, I would love to say that the best place to do so is on Mastodon, go to mas.to/Vrandecic, But to be honest, I'm still more around on Facebook and LinkedIn because I'm just an old person, so I'm still around there. But yeah, so this is, there's several ways you could reach out to me. I'm notoriously bad at catching emails, so this is usually a very bad way to try to reach me.

- You're not alone. Trust me, you're not alone.

- I'm still waiting for the LLM to read all my emails and tell you what I missed over the years.

- Well it's funny, I think, well, like I don't see why we can't use it, an AI tool for that to scan your emails, understand which ones you really need to read.

- Right?

- Create draft responses of the most prioritized ones so you can just approve or deny and you know, good to go. Although I would never do that with Joe, you know, I always read those emails.

- Well thanks, Eamon. You can try Clawdbot, though that was renamed to Moltbot and now is called Open Claw, but that will help you take care of all of that. It'll answer it for you if you want. The agents have gotten quite good. So I don't know if you've seen that, but that went pretty crazy the last weekend. So yeah, the biggest, the project that reached the highest number of stars in only three months.

- Wow.

- Maybe we'll have him on as well.

- Love it.

- Yeah. Well thank you so much, Denny. Always a pleasure, always. We always have to work hard to keep up with what you're doing and trying to even understand it. Good luck with the projects and maybe we'll get a, have you back one time to give us an update on the progress.

- Well, the good part about the process I work on, it's like the Wikidata. Once it's there, it becomes much, much easier to understand it and you want to see how it actually works. Our plan is to launch AppSec Wikipedia this year, in 2026. And I think once it's launched, once it's integrated with Wikipedia, the pieces will fall together like in the puzzle globe of our logo. And people can see how those things actually contribute to Wikipedia and how to make it easier to maintain Wikipedia and increase the quality of Wikipedia articles. But I do admit that before that it's, I never find it take exactly the right words to actually explain it so that it's easily understandable, unfortunately. And so I'm looking forward to have it actually launched and have it be available.

- Yeah, I think a graphical representation of how actually that works can go a long way, a long, long way. But again, thank you so much for your time today. We really appreciate it.

- Thank you so much for having me.

- Thank you, Denny.

- It was really fun.