Why this blog exists
Hi! Glad to have you here. I want to humanize this blog first, talk about the philosophy and risk models I’m hoping to explore, and finally talk about why you should care
I’m Kenny. I’m 21 years old. I grew up close to New York City, and am currently bouncing around. I’ve had many labels— HCI hacker, (quant) software engineer, philosopher, entrepreneur, tinkerer, humanist, researcher, aspiring psychologist, dancer/choreographer, singer/songwriter— but I don’t like to box myself into labels. I like to explore anything and everything the world has to offer, and try on many hats. My philosophy is that specific career and identity labels are imperfect and oftentimes rigid boxes to fit ourselves into, and that embracing our broader humanity (our abilities, our potential, our interests, our cares) lends naturally to many different kinds of seemingly disparate careers. I often have “crushes” on the idea of doing something, and embrace these “crushes” to get information and either get inspired or gain closure, gaining skills and connections in the process. I previously did this with a brief entrepreneurship sprint in spring 2025, and am now doing this with political campaigning.
(I’m hoping to invite other similarly-minded people to this blog. If you’re interested, please reach out!)
In the AI safety world, you’ll see a lot of people talking about X-risk— this belief that AI systems become so powerful, and we hand over so much control, that eventually unaligned models use their intellectual power to accomplish some goal that has a side effect of destroying humanity. I don’t focus on these things here. For a variety of reasons (I believe there are enough people working on this failure mode, this failure mode is improbable enough, I can contribute more value elsewhere, etc.), I’m convinced that I can best contribute by building frameworks that help us better understand and align whatever AI we end up with in a world where it does not kill us all.
Specifically, I believe there are no adequate frameworks on how to understand and analyze AI behavior in the sense of the values, philosophy, personality, cognitive behaviors, moral and social norms, and aesthetics that they will inevitably perpetuate. Regardless of whether AGI introduces X-risk, I imagine AI will be so well-integrated into our lives that we will subconsciously start to model ourselves after the values perpetuated and inhabited by AI, and whatever kind of personalities and moral thinking we build into these models will become the values we humans take on.
In other words, I believe it won’t be long until we stop designing AIs to match human values, and start subconsciously aligning our values to match these models
I believe these vicious cycles (or virtuous circles) are going to become an incredibly powerful force that will shape culture, morality, and human well-being, presumably into perpetuity. AI therapists and friends and lovers will define what “meaningful” and “safe” relationships mean to us, and what kinds of behaviors we do and don’t tolerate from others. AI thinkers, editors, and study buddies will define for us what kinds of ideas are “worth sharing” and “rigorous,” what opinions are correct, and ultimately gatekeep which ideas can spread and influence others. AI assistants and strategizers will tell us how to subtly manipulate people, or will teach us not to.
These are all emergent behaviors, that are hard to study from a base level. I don’t think we can analyze these, nor can we make the kinds of moral judgments we need, solely working from the ground-up with mechinterp. Luckily, humanity has thousands of years of practice and many useful frameworks analyzing an equally confusing black box— human psychology! I believe we can design rigorous evals and analyze these emergent behaviors with rigorous moral and design frameworks, by drawing on such a multidisciplinary approach.
Some example mini-papers I plan/hope to explore:
AI chatbots have been proposed again and again as tools for therapy, or as outright replacements for therapists. There are many arguments for why this might fall short, when compared to a good human therapist (therapist relationship, human closeness and relationship modeling, lack of corrective experiences, etc.). But what are some of the subtler risks and failure modes, like AIs’ implicit stigmatization of certain emotions that it perceives to be “more dangerous,” like anger, as learned from society? How can such subtle failure modes be harmful and shaming? What are some other subtle ways that AI models fail to understand shame, and how can we model these?
What is the impact of AIs’ willingness to change their minds? ChatGPT loves to quickly align to an ideology based on its past responses, and does not like to change its mind, if you get it to say something on its own accord. This models a really frustrating and potentially manipulative kind of human-AI relationship, teaches humans to be equally mentally rigid, and harms its truth-seeking value.
On the other hand, Claude is quick to change its mind and mentally flexible. But is it too flexible?Almost all major AI models (GPT, Claude, Gemini, Grok) are quite opinionated and moralistic, some more than others. My model is that Gemini is the most confidently wrong (and the most contemptuous about it), followed by Grok/GPT and Claude last. How do we model and measure abstract concepts like self-righteousness, and what are our goals for tuning such models? How do we quantitatively measure contempt in a rigorous way? To what extent should AI models be self-righteous and moralistic? How can AI models be trained to avoid manipulation, and how can we use concepts from human psychology to detect when AI models are manipulating others?
AIs maintain implicit internal state (~mental models) about the individuals they interact with. This is especially true for many systems both existing and up-and-coming. ow can we mechanistically constrain models to have unconditional positive regard for the people they talk to, and how can we influence this and map this out in the model? How can we build symbolic world-models and person-models, and use these symbolic models to prevent AI systems from veering off course?
AI models learn much of their beliefs implicitly from our culture, and from the kinds of diversity we already understand. However, in a world that is increasingly cosmopolitan, how can we program models that are flexible enough to not put people into rigid boxes (even if there are many of them)? How can we ensure psychological safety in human-AI interactions, and make models that include and can understand everyone, not just people who fit into pre-existing and well-defined categories like (dis)ability, race, etc.?
While existing RLHF approaches have been remarkable powerful so far, I believe we must be increasingly intentional with how we tune our models, and have rigorous frameworks for analyzing their personalities, character, and behaviors. I dream of a world where AI is emotionally intelligent, sensitive, caring and kind, and helps to augment human well-being. I’m afraid of a world where the opposite is true— where AI limits our thinking by putting us into boxes, modeling limiting beliefs and behaviors, and telling us what to think.
Such a multidisciplinary approach is uncharted territory. But I strongly believe that this mix of psychological, emotionally intelligent, critical, and quantitative thinking is crucial to building AI systems that best maximize our potential and our well-being as a species. If AI systems are not aligned, I imagine the long-term suffering this causes will be immense— a quiet kind of suffering, a loneliness, the kind of suffering that says “I don’t quite fit in and I think I’m broken. My life is okay on paper, but something deeper is missing because I never learned to make friends with messy and broken humans,” but a suffering nonetheless.
Join me!