Performance and Metrics Archives | ��ӽ� Business School AI Institute

Smarter Memories, Stronger Agents: How Selective Recall Boosts LLM Performance

HBS AI Content & Learning — Thu, 21 Aug 2025 12:26:01 +0000

One of AI agents’ most powerful tools is memory: the ability to learn from the past, adapt to new situations, and improve over time. But as organizations and professionals increasingly deploy AI agents for complex and long-term tasks, an important question emerges: how can we ensure that these systems learn from experience without getting trapped by their past mistakes? In the new paper “,” , Assistant Professor of Business Administration at ��ӽ� Business School and PI in the Trustworthy AI Lab at the Digital Data Design (D^3) Institute at ��ӽ�, and several co-authors delve into the critical role of memory management in LLM agents. Their paper sheds light on how strategic addition and deletion of experiences can impact the long term performance of AI agents and, critically, how the absence or mismanagement of these measures can actually make agents worse.

Key Insight: Accelerate or Anchor?

“[A] high ‘input similarity’ between the current task query and the one from the retrieved record often yields a high ‘output similarity’ between their corresponding (output) executions.” [1]

The study identifies a foundational behavioral pattern: when an agent’s current task closely resembles a stored memory, the outputs tend to closely match as well. This “experience-following” correlation mirrors how humans often rely on familiar patterns, and it can accelerate learning when the stored example is correct. However, it’s also not without risks. If erroneous or low-quality experiences are stored in memory, they can be applied to future tasks, thereby decreasing the agent’s overall performance. This means that the quality of stored examples is paramount, as bad memories don’t just linger, they can create a propagating error feedback loop.

Key Insight: Selective Addition

“[S]imply storing every experience leads to significantly worse outcomes.” [2]

If the experience-following property shows why quality matters in LLM agents, then addition shows how to control it, and a clear finding from the study is that indiscriminate memory growth actually hurts performance. In tests with three different agents, covering electronic health records (EHRs), the LLM-based autonomous driving agent AgentDriver, and a network security agent, storing every task and output (“add-all”) performed worse than using no memory addition at all. However, using strict evaluation criteria and filtering before storage led to an average 10% performance boost, so memory improvement is less about hoarding information than curating a high-quality knowledge base.

Key Insight: Improvement through Deletion

“History-based deletion consistently removes poor demonstrations with low output similarity, thereby improving long-term performance.” [3]

Even with careful addition, not all stored experiences are equally useful over time. Some look similar to new tasks (“high input similarity”), but consistently produce poor output (“low output similarity”). The authors term this “misaligned experience replay,” and show that pruning these entries improves long-term outcomes. Removing experiences with repeatedly low utility (“history-based deletion”) offered the best boost to performance while effectively and efficiently maintaining memory size. From a strategic perspective, this practice mirrors audits of playbooks, datasets, and best practices to ensure that institutional knowledge remains in top shape.

Why This Matters

The results from this research should give business leaders important context for thinking about how to choose and deploy AI agents: more data isn’t automatically better, and AI’s “experience” can actually be a liability, entrenching errors and bloating infrastructure. Disciplined curation, by selectively adding high-value experiences and strategically deleting low-value or misaligned ones, yields not only better accuracy but also more efficient, adaptable systems. In a world where executives may be involved in decision-making around LLM agents for their organizations, it’s important to have a blueprint for keeping AI agents sharp, reliable, and resilient, just like they plan for the training and advancement of their human employees. By understanding and investing in the processes that keep your AI’s memory in top-shape, your business will be equipped for tomorrow’s challenges.

References

[1] Zidi Xiong et al., “How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior,” arXiv preprint arXiv:2505.16067v1 (May 21, 2025): 2.

[2] Xiong et al., “How Memory Management Impacts LLM Agents,” 5.

[3] Xiong et al., “How Memory Management Impacts LLM Agents,” 9.

Meet the Authors

is a PhD student in computer science at ��ӽ� University, advised by Himabindu Lakkaraju.

is a PhD student in computer science at Michigan State University.

is a PhD student in computer science engineering at University of Minnesota – Twin Cities.

is a PhD student in computer science and engineering at Michigan State University.

is a University Foundation Professor in the computer science and engineering department at Michigan State University.

is an Assistant Professor of Business Administration at ��ӽ� Business School and PI in D^3’s Trustworthy AI Lab. She is also a faculty affiliate in the Department of Computer Science at ��ӽ� University, the ��ӽ� Data Science Initiative, Center for Research on Computation and Society, and the Laboratory of Innovation Science at ��ӽ�. Professor Lakkaraju’s research focuses on the algorithmic, practical, and ethical implications of deploying AI models in domains involving high-stakes decisions such as healthcare, business, and policy.

is Assistant Professor of Computer Science at University of Georgia.

The post Smarter Memories, Stronger Agents: How Selective Recall Boosts LLM Performance appeared first on ��ӽ� Business School AI Institute.

Teaching Trust: How Small AI Models Can Make Larger Systems More Reliable

HBS AI Content & Learning — Thu, 03 Jul 2025 16:56:06 +0000

As Gen AI technology continues to rapidly evolve and LLMs are integrated into more and more applications, questions of trustworthiness and ethical alignment become increasingly crucial. In the recent study “,” authors , postdoctoral researcher at ��ӽ� working on trustworthy AI; , undergraduate student at ��ӽ� studying computer science; , PhD student in computer science at ��ӽ�; , postdoctoral research associate at ��ӽ� working on trustworthy AI; and , Assistant Professor of Business Administration at ��ӽ� Business School and PI in D^3’s Trustworthy AI Lab, explore a novel concept: the ability to transfer and enhance trustworthiness properties from smaller, weaker AI models to larger, more powerful ones.

Key Insight: The Three Pillars of AI Trustworthiness

“Trustworthiness encompasses properties such as fairness (avoiding biases against certain groups), privacy (protecting sensitive information), and robustness (maintaining performance under adversarial conditions or distribution shifts).” [1]

The holistic conceptualization taken by the authors in this paper recognizes that, for LLMs to be truly trustworthy, they must excel across multiple domains simultaneously. The researchers tested and demonstrated these principles using real-world datasets, including the Adult dataset, based on 1994 U.S. Census data, where they evaluated fairness by examining whether AI predictions of income varied based on gender attributes. Their privacy assessments used the Enron email dataset, containing over 600,000 emails with sensitive personal information including credit card numbers and Social Security Numbers. For robustness, they used the OOD Style Transfer, which incorporates text transformations, and AdvGLUE++ datasets, which includes examples for widely used Natural Language Processing (NLP) tasks.

Key Insight: Utilizing Novel Fine-Tuning Strategies

“This is the first work to investigate if trustworthiness properties can transfer from a weak to a strong model using weak-to-strong supervision, a process we term weak-to-strong trustworthiness generalization.” [2]

The ��ӽ� team developed two distinct strategies for embedding trustworthiness into AI systems. Their first approach, termed “Weak Trustworthiness Fine-tuning” (Weak TFT), focuses on training smaller models with explicit trustworthiness constraints, then using these models to teach larger systems. The second strategy, “Weak and Weak-to-Strong Trustworthiness Fine-tuning” (Weak+WTS TFT), applies trustworthiness constraints to both the small teacher model and the large student model during training.

Their experiments demonstrate that the Weak+WTS TFT approach produces significantly superior results, with improvements in fairness of up to 3 percentage points (equivalent to a 60% decrease in unfairness), as well as in robustness, or how resilient the AI was to attacks and unexpected situations. Remarkably, these ethical improvements required only minimal sacrifices in task performance—decreases in accuracy did not exceed 1.5% across tested properties.

Key Insight: Challenges in Privacy Transfer

“Privacy presents a unique situation. Note that the strong ceiling (1) does not achieve better privacy than the weak model.” [3]

A key finding of the study is that not all trustworthiness properties transfer equally from weak to strong models. While the transfer of fairness and robustness properties showed promising results, privacy proved to be a more challenging attribute to transfer. The researchers found that larger models have a greater capacity to retain and recall details from their training data, which creates heightened vulnerabilities for exposing sensitive or confidential information. This finding highlights the complex nature of privacy in AI systems and suggests that different strategies may be needed to address privacy concerns in larger models.

Why This Matters:

For C-suite executives and business leaders, this research offers a potential pathway to developing more powerful LLM systems without compromising on certain ethical considerations. It suggests that companies could potentially start with smaller, more manageable models that are fine-tuned for trustworthiness in fairness and robustness, and then scale up to more capable systems while maintaining or even improving these critical properties. This approach could help mitigate risks associated with LLM deployment, enhance public trust in AI-driven decisions, and potentially reduce the resources required for ethical LLM development. However, the challenges identified in transferring privacy properties serve as a reminder of the complex nature of AI ethics. Business leaders should remain vigilant and consider multi-faceted approaches to ensuring the trustworthiness of their LLM systems, particularly when dealing with sensitive data.

Footnote

(1) The strong ceiling represents the benchmark performance of a large model that has been directly trained with trustworthiness constraints, serving as the upper bound for what the weak-to-strong approach should ideally achieve.

References

[1] Martin Pawelczyk et al., “Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models,” arXiv preprint arXiv:2501.00418v1 (December 31, 2024): 1.

[2] Pawelczyk et al., “Generalizing Trust,” 2.

[3] Pawelczyk et al., “Generalizing Trust,” 8.

Meet the Authors

is a postdoctoral researcher at ��ӽ� working on trustworthy AI.

is an undergraduate student at ��ӽ� studying computer science.

is a PhD student in computer science at ��ӽ�.

is a postdoctoral research associate at ��ӽ� working on trustworthy AI.

The post Teaching Trust: How Small AI Models Can Make Larger Systems More Reliable appeared first on ��ӽ� Business School AI Institute.

The Future of Decision-Making: How Generative AI Transforms Innovation Evaluation

HBS AI Content & Learning — Wed, 15 Jan 2025 14:37:51 +0000

As businesses grapple with an ever-growing volume of ideas, products, and solutions to evaluate, decision-making processes are being reshaped by artificial intelligence (AI). Generative AI, in particular, has emerged as a game-changer in creative problem-solving and evaluation, as demonstrated by a recent field experiment described in the working paper “.”&�Բ��;

The paper—by , Assistant Professor at ��ӽ� Business School and a co-Principal Investigator of the (LISH) at ��ӽ�’s Digital Data Design Institute (D^3) and a team of researchers (see Meet the Authors section below for details)—describes how AI can augment decision-making for early-stage innovation screening.

The experiment, conducted with MIT Solve, included 72 experts and 156 non-expert community screeners who evaluated 48 solutions submitted to the 2024 Global Health Equity Challenge. The team used the GPT-4 large language model (LLM) to recommend whether to pass or fail each idea and provide criteria for failure. The evaluation phase was designed with three conditions:

A human-only control condition, with no AI assistance
Treatment 1: black box AI (BBAI), AI recommendations without rationale
Treatment 2: Narrative AI (NAI), AI recommendations with rationale

Key Insight: AI-Augmented Decisions Are More Stringent

“Screeners were 9 percentage points more likely to fail a solution under the treatment conditions than the control condition.” [1]

Generative AI can be a source of rigor in evaluation. According to the authors, evaluators using AI recommendations were more discerning in their decision-making compared to human-only groups. The study highlights that AI-assisted screeners tended to fail solutions more often than their human-only counterparts, particularly when using treatment 2, which provided detailed narratives justifying its recommendations.

The NAI approach stood out as particularly effective, especially for subjective criteria like quality or alignment with goals. The researchers observed that human screeners were significantly more likely to follow narrative AI’s recommendations because the rationale added credibility and context to its suggestions.

Key Insight: Balancing Objectivity and Subjectivity in AI Collaboration

“[E]ffective decision-making for subjective criteria requires human oversight and close collaboration with AI.” [2]

While AI excels at tasks requiring objective analysis, its role in subjective evaluations remains nuanced. The study revealed a marked difference in human alignment with AI recommendations based on whether the criteria were objective or subjective. For objective tasks, such as assessing technical feasibility, AI provided valuable consistency. However, for subjective tasks, such as evaluating novelty or aesthetics, human oversight was indispensable. The researchers noted that over-reliance on AI narratives for subjective decisions could sometimes lead to uncritical acceptance of its conclusions.

Key Insight: The Rise of AI Interaction Expertise

“[Our findings suggest] the emergence of a new form of expertise—AI interaction expertise—which involves effectively interpreting, questioning, and integrating AI-generated insights into decision-making processes.” [3]

The authors suggested that integrating AI into decision-making demands more than technical know-how; it requires “AI interaction expertise.” The paper emphasized that screeners who deeply engaged with AI recommendations—examining and, when necessary, challenging them—were better able to integrate AI insights into their decisions. This highlights a new skill set for the modern workforce: the ability to collaborate effectively with AI systems.

Why This Matters

The authors’ experiment and conclusions can help C-suite and business executives assess the value of using LLMs in decision-making, specifically by:

Recognizing AI’s strengths and weaknesses related to objective and subjective decision-making criteria. LLMs can potentially be used to pre-screen decisions based on objective criteria, and send those results to human screeners. Decisions involving subjective criteria require close human-AI collaboration, where AI tools act as “sounding boards” that complement the decision-making process.
Understanding the importance of AI interaction expertise in the workforce to interpret AI results and implementing AI training that highlights the value of human perspectives and the uses and risks of AI tools.

As is often the case in studies of the current state of generative AI tools, the authors concluded that “The key lies in leveraging LLMs as tools to augment human decision-making rather than replace it entirely.” [4]

References

[1] Jacqueline N. Lane, Léonard Boussioux, Charles Ayoubi, Ying Hao Chen, Camila Lin, Rebecca Spens, Pooja Wagh, and Pei-Hsin Wang, “The Narrative AI Advantage? A Field Experiment on Generative AI-Augmented Evaluations of Early-Stage Innovations”, ��ӽ� Business School Working Paper 25-001 (2024): 1-60, 5.

[2] Lane, et al., “The Narrative AI Advantage? A Field Experiment on Generative AI-Augmented Evaluations of Early-Stage Innovations”, 33.

[3] Lane, et al., “The Narrative AI Advantage? A Field Experiment on Generative AI-Augmented Evaluations of Early-Stage Innovations”, 31.

[4] Lane, et al., “The Narrative AI Advantage? A Field Experiment on Generative AI-Augmented Evaluations of Early-Stage Innovations”, 36.

Meet the Authors

is an Assistant Professor at ��ӽ� Business School and a co-Principal Investigator of the (LISH) at ��ӽ�’s Digital Data Design Institute (D^3). She earned her Ph.D. from Northwestern University.

, is an Assistant Professor in the Department of Information Systems and Operations Management at the University of Washington, Foster School of Business, with an adjunct position at the Allen School of Computer Science and Engineering. He earned his Ph.D. in at the .

is a Postdoctoral Research Fellow at the Laboratory for Innovation Science at ��ӽ� (LISH) supported by a research grant from the Swiss National Science Foundation (SNSF). His research examines the processes of knowledge creation and diffusion in the context of science and innovation. He studies how scientists use their resources and informational advantages to achieve scientific breakthroughs, greater dissemination of knowledge and accessibility of innovation.

is a Lecturer at the University of Washington Global Innovation Exchange.

is an AIOps Product Manager at Microsoft. Prior to her work at Microsoft, Lin earned her Master’s in Information Systems from the University of Washington where she worked as a Research Assistant.

is Results Measurement Manager and focuses on using research methods to understand Solve’s effectiveness and impact. Before joining Solve, Rebecca worked on evaluation and research in UK government, most recently at the Ministry of Justice. Rebecca holds a Master’s in Development Practice from Emory University and a BA in Modern History and French from the University of St. Andrews.

is Director, Operations & Impact at . Pooja came to Solve in 2017 with over a decade of experience in international development, program evaluation, and data analysis in the private and nonprofit sectors. Pooja holds a Masters in Public Policy from the ��ӽ� Kennedy School and a Bachelors in electrical engineering from MIT.

is a Cloud First Product Manager at Accenture. At the time of the research article’s publication, Wang was a Research Assistant and Data Scientist at the University of Washington.

The post The Future of Decision-Making: How Generative AI Transforms Innovation Evaluation appeared first on ��ӽ� Business School AI Institute.

Bridging the Gap Between Understanding and Control: Insights into AI Interpretability

HBS AI Content & Learning — Fri, 10 Jan 2025 15:38:21 +0000

As large language model (LLM) systems grow in complexity, the challenge of ensuring their outputs align with human intentions has become critical. Interpretability—the ability to explain how models reach their decisions—and control—the ability to steer them toward desired outcomes—are two sides of the same coin.

“”—research by , Graduate Fellow PhD student at ��ӽ� University Kempner Institute and the Digital Data Design Institute (D^3) Trustworthy AI Lab; , Research Scientist at Bosch AI; , Assistant Professor of Business Administration at ��ӽ� Business School and PI in D^3’s Trustworthy AI Lab; and , Senior Research Scientist at Google DeepMind—found that many methods developed to address these issues focus on one aspect, neglecting the other. The study introduces a new approach that unifies interpretability and control and proposes intervention as the primary goal, and evaluates how well different methods enable control through intervention.

Key Insight: Intervention as a Fundamental Goal of Interpretability

“[W]e view intervention as a fundamental goal of interpretability, and propose to measure the correctness of interpretability methods by their ability to successfully edit model behaviour.” [1]

The authors define intervention as the deliberate modification of specific human-interpretable features within a model’s latent representations¹ to achieve desired changes in its outputs, or its responses to prompts. They argue that the ability to intervene in a model’s behavior this way should be a core objective of interpretability methods. By focusing on intervention, they provide a practical way to assess the effectiveness of various interpretability techniques. This approach shifts the focus from understanding a model’s inner workings to actively influencing its outputs, bridging the gap between theory and application.

Key Insight: A Unified Framework for Interpretability and Control

“[W]e present an encoder-decoder framework that unifies four popular mechanistic interpretability methods: sparse autoencoders, logit lens, tuned lens, and probing.” [2]

The study uncovered a critical limitation in current interpretability methods: their performance varies significantly across different models and features. To address these performance issues, Bhalla et al. present a new approach to unifying diverse interpretability methods under a single framework—the encoder-decoder model. Their framework maps intermediate latent representations to feature spaces that are understandable by humans, allowing interventions to these features. These changes can then be translated back into latent representations to influence the model’s outputs.The study evaluates four methods within its unified framework to determine their relative strengths and weaknesses for both interpretability and control:

Logit Lens: Easy to use, requires no training, maps features directly to individual tokens in the model’s vocabulary, and generally has high causal fidelity², but is limited by predefined, static features
Tuned Lens: Extends Logit Lens with additional learned linear transformation³, which improves its flexibility and effectiveness, but requires additional training and tuning
Sparse autoencoders (SAEs): Can learn a large dictionary of low-level and high-level or abstract features, but are difficult to train and label and shows lower causal fidelity
Probing: Trains simple classifiers (often linear) on top of model representations to predict specific features or concepts, but is prone to spurious correlations, leading to low causal fidelity

Key Insight: Measuring Success Through Interventions

“[W]e propose two evaluation metrics for encoder-decoder interpretability methods, namely (1) intervention success rate; and (2) the coherence-intervention tradeoff to evaluate the ability of interpretability methods to control model behavior.” [3]

The authors introduce two metrics to determine if interventions are accurate and maintain the integrity and functionality of AI systems in real-world applications:

Intervention success rate: Measures the effectiveness, or whether the intervention achieves its goal
Coherence-intervention tradeoff: Measures practical utility, ensuring the intervention does not make the model’s outputs unusable by affecting its coherence and quality

Among the methods evaluated, the two lens-based approaches had the highest intervention success rates. However, due to current shortcomings, such as inconsistency across models and features, and the potential compromising of performance and coherence, the authors found that, when it comes to directing model behavior, simpler options, such as prompting, prevail over intervention methods.

Why This Matters

For business professionals and C-suite executives, the insights presented by Bhalla and her team represent a pivotal development in the practical application of AI technologies. As organizations increasingly rely on AI for tasks ranging from low-level to critical, understanding how to align these systems with human and organizational values is paramount. The proposed framework and metrics provide actionable tools to ensure AI systems are both correct and usable. The study also underscores the need to select and evaluate interpretability methods carefully based on the specific models used and tasks involved.

Footnotes

(1) Latent representation refers to the internal, abstract representation of data within a machine learning model. These representations are not directly interpretable by humans but encode meaningful patterns or features of the input data.

(2) Causal fidelity is the extent to which intervening on a specific feature of an explanation results in the corresponding change in the model’s output.

(3) A linear transformation is a mathematical function that converts one vector into another while maintaining the properties of vector addition and scalar multiplication. Put simply, it changes the direction and size of vectors without warping or distorting the structure of the space they occupy.

References

[1] Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, and Himabindu Lakkaraju, “Towards Unifying Interpretability and Control: Evaluation via Intervention”, arXiv preprint arXiv:2411.04430v1 (November 7, 2024): 2.

[2] Bhalla, et al. “Towards Unifying Interpretability and Control: Evaluation via Intervention”, 3.

[3] Bhalla, et al. “Towards Unifying Interpretability and Control: Evaluation via Intervention”, 3.

Meet the Authors

, is a PhD student in the ��ӽ� Computer Science program at ��ӽ� University Kempner Institute, and a fellow at the Digital Data Design Institute (D^3) Trustworthy AI Lab. Advised by Hima Lakkaraju, her research focuses on machine learning interpretability. Bhalla is also a dedicated advocate for diversity in computer science, mentoring early-career minority students to support their growth in the field.

is a Research Scientist at Bosch AI with a focus on model interpretability, data-centric machine learning, and the “science” of deep learning. They completed their Ph.D. with François Fleuret at Idiap Research Institute & EPFL, Switzerland, and were a postdoctoral research fellow with Hima Lakkaraju at ��ӽ� University. They have organized workshops and seminars on interpretable AI, including sessions at NeurIPS 2023 and 2024, and contributed to teaching an explainable AI course at ��ӽ�. Their work bridges theoretical advancements and practical applications of explainable AI.

is an Assistant Professor of Business Administration at ��ӽ� Business School and PI in D^3’s Trustworthy AI Lab. She is also a faculty affiliate in the Department of Computer Science at ��ӽ� University, the ��ӽ� Data Science Initiative, Center for Research on Computation and Society, and the Laboratory of Innovation Science at ��ӽ�. She teaches the first year course on Technology and Operations Management, and has previously offered multiple courses and guest lectures on a diverse set of topics pertaining to Artificial Intelligence (AI) and Machine Learning (ML), and their real world implications.

is a Senior Research Scientist at Google DeepMind, where she focuses on aligning AI with human values by understanding, controlling, and demystifying language models. She earned her Ph.D. from the MIT Media Lab’s Affective Computing Group and has conducted research at Google Research, Microsoft Research, and EPFL. Previously, she worked in digital mental health, collaborating with ��ӽ� medical professionals and publishing in leading journals.

The post Bridging the Gap Between Understanding and Control: Insights into AI Interpretability appeared first on ��ӽ� Business School AI Institute.

The Promise and Pitfalls of AI in Strategic Decision-Making

HBS AI Content & Learning — Tue, 07 Jan 2025 19:59:12 +0000

As artificial intelligence (AI) continues to advance rapidly, its potential to transform strategic decision-making processes in business is becoming increasingly apparent, but how can strategists be sure their AI tools are getting it right? A recent study, “”, by researchers , Assistant Professor of Strategy and Entrepreneurship at the UCL School of Management, , Associate Professor of Marketing at University of Oxford’s Saïd Business School, , a research fellow at the UCL School of Management, and , Associate Professor in Strategy and Entrepreneurship at the UCL School of Management, explores how generative AI, particularly large language models (LLMs), can be leveraged to evaluate strategic decisions like selecting business models. Their findings reveal both the current limitations and future promise of AI as a tool for strategic foresight.

The paper investigates generative AI’s use in strategic decision-making through two studies: Study 1 evaluates 60 AI-generated business models from various industries, while Study 2 assesses 60 competition-submitted models. Business models were paired within industries and assessed by AI, human experts, and non-experts. AI evaluations were aggregated across multiple LLMs (Anthropic, Google, Meta, Mistral, OpenAI), roles (e.g., founder, investor, industry expert), and prompts, to measure the effects of diversity and scale. The approach emphasized systematic comparison through consistent pairwise evaluation methods, comparing two options and selecting which business model was more likely to succeed.

Key Insight: AI Bias and Inconsistency

“We find that individual generative AI evaluations are inconsistent and biased.” [1]

The researchers found that when asked to evaluate business models individually, AI systems often produced inconsistent results. The order in which business models were presented could affect the AI’s choice, and there were systematic biases toward selecting either the first or second option.

In Study 1, for example, the highest consistency—that is, when the evaluation of business models A and B yielded the same prediction as the evaluation of B and A—among LLMs was 80.9%, achieved by GPT-4 Turbo using the chain-of-thought prompt. Other models showed significantly lower consistency, such as Claude2 with the base prompt, which reached just 42.2%. Similarly, in Study 2, consistency varied widely, ranging from 29.9% for GPT-3.5 with the chain-of-thought prompt to 78.1% for Llama 3 with the base prompt.

Key Insight: Aggregating AI Evaluations Improves Accuracy

“[A]ggregating these [individual] evaluations results in increased agreement with human experts.” [2]

While individual AI evaluations were problematic, the researchers discovered that aggregating multiple AI evaluations produced results that aligned more closely with human expert judgments. In both studies, the comprehensive AI evaluator achieved a Pearson correlation¹ of about 0.67 with human expert rankings, indicating a strong positive linear relationship. The Spearman correlation² in Study 1 was lower, at 0.463, which was similar to non-human experts, but was higher in Study 2, at 0.72.

The study also used “top choice” and “bottom choice” as measures to pick overall winners and losers. This metric is particularly relevant if the primary goal of the evaluation process is to select the most promising option, as in venture capital funding or incubator programs where identifying and supporting winners is key. In Studies 1 and 2, the aggregated AI evaluation matched human experts in 5 and 4 out of 10 industries, respectively, while human non-experts matched in only 2 of 10 in Study 1.

Key Insight: Diversity and Scale Both Contribute to Improved AI Performance

“The wisdom of the crowd, or the benefit of aggregating predictions, depends on two mechanisms: the crowd’s diversity and scale.” [3]

The study examines how diversity and scale influence the effectiveness of AI evaluations. Diversity, achieved by aggregating outputs from multiple LLMs, roles, and prompts, modestly improved alignment with human experts. Scaling, which involved increasing the number of evaluations aggregated, had a more substantial impact on agreement. The comprehensive AI evaluator, combining diversity and scaling, outperformed others. The findings emphasize that while diversity offsets errors through varied perspectives, scaling consistently enhances the predictive accuracy of aggregated AI evaluations.

Why This Matters

For business leaders, this research highlights both AI’s potential and its potential pitfalls in strategic decision-making. To leverage AI effectively, businesses should implement diverse and large-scale approaches rather than relying on a single model. This aggregated output can provide valuable data-driven insights that can be considered alongside human judgment and expertise. As AI continues to evolve, it will be increasingly able to augment human decision-making, offering a competitive edge by improving the quality and efficiency of critical business strategies.

Footnotes

(1) The Pearson correlation measures the linear association between two continuous variables. In this study, it reflects the strength and direction of the relationship between the win proportions assigned to business models by AI and by human experts.

(2) The Spearman correlation measures the monotonic association between two variables. In this study, it examines the similarity in the rankings of business models based on their win proportions as assigned by AI and by human experts. This correlation considers only the rank order of the business models and not the actual magnitude of the differences in win proportions.

References

[1] Anil Doshi, J Jason Bell, Emil Mirzayev, and Bart Vanneste, “Generative Artificial Intelligence and Evaluating Strategic Decisions”, Strategic Management Journal (Forthcoming 2025, Available at SSRN: ): 1-37, 27.

[2] Doshi et al., “Generative Artificial Intelligence and Evaluating Strategic Decisions”, 27.

[3] Doshi et al., “Generative Artificial Intelligence and Evaluating Strategic Decisions”, 10.

Meet the Authors

is an Assistant Professor of Strategy and Entrepreneurship at the UCL School of Management. Anil earned his doctorate from the Technology and Operations Management unit at ��ӽ� Business School. He received an A.B. in Economics and Government from Dartmouth College.

is an Associate Professor of Marketing at University of Oxford’s Saïd Business School. He works under the Future of Marketing Initiative and studies AI, perception, new products and choice processes. He uses Bayesian methods to model consumer demand and decision making and his work has been published in peer-reviewed journals such as Marketing Science and Journal of Marketing.

is a research fellow at UCL School of Management. He has a PhD in Management from SKEMA Business School and a PhD in Economics from Université Cote D’Azur.

is an Associate Professor in Strategy and Entrepreneurship at the UCL School of Management. Bart’s research focuses on artificial intelligence and corporate strategy. He is the Program Director of the AI for Business executive education program at the UCL School of Management.

The post The Promise and Pitfalls of AI in Strategic Decision-Making appeared first on ��ӽ� Business School AI Institute.

Promoting Fair Representation in AI Image Retrieval

HBS AI Content & Learning — Thu, 02 Jan 2025 19:31:21 +0000

As artificial intelligence systems become more prevalent in our daily lives, ensuring these technologies are fair and representative of diverse populations is increasingly critical. A recent study, “”, conducted by , an Associate Professor of Electrical Engineering at , and his research group (see the bottom of the page for author details) at ��ӽ� University, introduces an innovative approach to measuring and promoting diversity in AI image retrieval systems. Their work addresses a key challenge in the field: how to ensure retrieved images reflect the true diversity of society across multiple intersecting demographic groups.

Key Insight: Current Approaches Fall Short on Intersectional Representation

“Ensuring representation across individual groups (e.g., given by gender or race) does not guarantee representation across intersectional groups (e.g., given by gender and race).” [1]

The researchers found that existing methods for promoting diversity in image retrieval often focus on balancing representation across a small number of pre-defined groups, typically based on single attributes like gender or race. However, they argue that this approach fails to account for intersectional groups, those defined by multiple overlapping attributes. For example, a system may retrieve an equal number of men and women but still under-represent women of color. The study demonstrates that optimizing for individual group representation does not necessarily lead to fair representation of intersectional groups.

Key Insight: A New Metric for Multi-Group Representation

“We propose a metric called Multi-group Proportional Representation (MPR) to quantify the representation of intersectional groups in retrieval tasks. MPR measures the worst-case deviation between the average values of a collection of representation statistics computed over retrieved items relative to a reference population whose representation we aim to match.” [2]

To address the current representation gap, the researchers developed a metric called Multi-Group Proportional Representation (MPR), which quantifies how well a set of retrieved images represents diverse intersectional groups compared to a reference population. Crucially, MPR can measure representation across a large or even infinite number of overlapping groups, defined by complex combinations of attributes. This allows for a much more nuanced and comprehensive assessment of diversity and representation.

Key Insight: Scalability and Flexibility

“MPR offers a more flexible, scalable, and theoretically grounded metric for multi-group representation in retrieval.” [3]

A key advantage of the MPR approach is its scalability and flexibility. Unlike methods that rely on pre-defined groups, MPR can handle an arbitrary number of intersectional groups defined by complex functions. The researchers provide theoretical guarantees on the sample complexity required to estimate MPR accurately. They also demonstrate how MPR can be efficiently computed for several practical function classes, including linear functions and decision trees. This makes MPR a powerful and adaptable tool for measuring and optimizing representation in large-scale retrieval systems.

Key Insight: Ethical and Practical Considerations

“There are legal and regulatory risks with overreliance on a single metric for fairness, especially if this metric is used to inform policy and decision-making.” [4]

While MPR is a powerful tool, the authors caution against viewing it as a standalone solution. Fairness is multidimensional, and overreliance on a single metric can lead to unintended consequences, such as reinforcing stereotypes or overlooking other forms of harm. Furthermore, the researchers warn that the deployment of MPR by companies could result in “ethics-washing”, where, even when systems exhibit representational harms, firms claim them to be fair based on their use of a fairness metric, like MPR. Finally, to ensure that the results of MPR are diverse, ethical, and fair, the researchers suggest utilizing datasets that are curated to guarantee they represent diverse populations. Failing to do so can result in propagating biases throughout the system.

Why This Matters

For C-suite executives, the introduction of Multi-Group Proportional Representation (MPR) signals a transformative step in aligning artificial intelligence systems with values of fairness and inclusivity. MPR tackles a critical shortfall in current AI practices: the failure to represent diverse intersectional groups in image retrieval and similar applications. By quantifying proportional representation, MPR offers a scalable and actionable framework for mitigating bias while preserving the functionality of the retrieval system.

Adopting MPR isn’t just an ethical responsibility, it’s a strategic imperative. Inclusive AI systems foster trust among consumers and employees, safeguard against reputational and regulatory risks, and enhance decision-making by accurately reflecting the diversity of society. With tools like MPR and the Multi-group Optimized Proportional Retrieval (MOPR) algorithm, organizations can lead in embedding fairness into their technological foundations, transforming inclusivity from a compliance checkbox into a competitive advantage.

References

[1] Alex Osterling, Claudio Mayrink Verdun, Carol Xuan Long, Alexander Glynn, Lucas Monteiro Paes, Sajani Vithana, Martina Cardone, and Flavio du Pin Calmon, “Multi-Group Proportional Representation in Retrieval,” arXiv preprint arXiv:2407.08571 (2024): 1-48, 2.

[2] Alex Osterling et al., “Multi-Group Proportional Representation in Retrieval”, 2.

[3] Alex Osterling et al., “Multi-Group Proportional Representation in Retrieval”, 2.

[4] Alex Osterling et al., “Multi-Group Proportional Representation in Retrieval”, 5.

Meet the Authors

is a PhD student at ��ӽ� under the mentorship of and and is supported by the . They are broadly interested in fair, interpretable, and trustworthy machine learning, and their current projects apply information theoretic tools to problems in fairness and representation learning.

is a mathematician working with mathematics of AI and machine learning at ��ӽ�’s School of Engineering and Applied Sciences under the mentorship of . His research focuses on trustworthy machine learning, exploring concepts such as fairness and arbitrariness, and also on mechanistic interpretability techniques for large generative models.

is a 4th-year Ph.D. student at ��ӽ�, advised by . She completed her undergraduate degree in Math and Computer Science at and previously interned at and . Her research interest lies in Responsible and Trustworthy Machine Learning, and her work spans LLM watermarking, algorithmic fairness, multiplicity, and more.

has a degree in Applied Mathematics from ��ӽ� University, and completed a research fellowship at the ��ӽ� John A. Paulson School of Engineering and Applied Sciences focused on developing software pipelines to analyze and audit algorithms with novel techniques. He currently works as a Data Scientist at C3 AI.

is an Applied Mathematics Ph.D. candidate at working with and a student researcher at in the Gemini Safety Team. He uses theoretical insights to develop safe and trustworthy AI and ML systems. Their research is driven by the belief that AI and ML systems should not only be accurate and efficient but also transparent, fair, and aligned with human values and societal norms.

is a Postdoctoral Research Fellow at the ��ӽ� John A. Paulson School of Engineering and Applied Sciences. Her research interests include information theory, private information retrieval, and machine learning.

works in the Electrical and Computer Engineering Department at the University of Minnesota as an Assistant Professor. From July 2015 to August 2017, she was a post-doctoral research fellow in the Electrical and Computer Engineering Department at UCLA Henry Samueli School. She received her B.Sc. and M.Sc. from Politecnico di Torino in 2009 and 2011, respectively. As part of a Double Degree program, in 2011 she also earned a M.Sc. from Télécom ParisTech – EURECOM. She completed her Ph.D. in Electronics and Communications at EURECOM – Télécom ParisTech.

is an Associate Professor of Electrical Engineering at . Before joining ��ӽ� he was a social good post-doctoral fellow at in Yorktown Heights, New York. He received his Ph.D. in at MIT. His main research interests are information theory, signal processing, and machine learning.

The post Promoting Fair Representation in AI Image Retrieval appeared first on ��ӽ� Business School AI Institute.

Key Lessons from Census III on Open Source Software

HBS AI Content & Learning — Fri, 20 Dec 2024 14:54:46 +0000

With an estimated 96% of codebases incorporating Free and Open Source Software (FOSS), it forms the backbone of modern businesses, driving innovation and reducing costs across industries. However, its decentralized and distributed nature makes assessing its health, economic value, and security a significant challenge. The recently released report, , by , Assistant Professor at ��ӽ� Business School and faculty affiliate at D^3 in the Laboratory for Innovation Sciences at ��ӽ�, and collaborators at ��ӽ� and LINUX (see full list of authors below), provides an in-depth analysis of the OSS landscape, revealing key trends and risks.

The Census III report utilizes a similar methodology to Census II (2022), but with a more comprehensive dataset. It provides eight rank-ordered Top 500 lists of FOSS usage, based on over 12 million 2023 data points from four Software Composition Analysis (SCA) partners. The authors note that: “Operating under data constraints, the findings of this report cannot – and do not purport to – be a definitive claim of which FOSS packages are the most critical.” The report’s findings should be viewed, rather, as the authors’ best estimate of which FOSS application library packages are most widely used. [1]

Key Insight: The Rise of Cloud-Specific FOSS Packages

“The use of cloud service-specific packages is increasing, with high-ranking components that did not rank in Census II.” [2]

As businesses increasingly migrate operations to the cloud, the adoption of FOSS packages tailored to cloud services has surged. For instance, packages like boto3, used for AWS services, and google-cloud-go, used for Google Cloud, ranked among the top FOSS packages. These cloud-specific tools empower firms to streamline operations and innovate quickly in competitive markets. Census III shows that businesses increasingly depend on these packages to address scalability, system integration, and service-specific automation challenges.

Key Insight: Persistent Challenges in Software Version Transitions

“There is an ongoing transition from Python 2 to Python 3, demonstrating the challenges of transitioning to new versions of software with incompatibilities.” [3]

Transitioning to updated software versions remains a significant hurdle for many organizations. The report highlights that even after 15 years, in 2022, 7% of Python developers still used Python 2, which is no longer supported for security updates. They noted, furthermore, that usage remains significantly higher in specific fields, including 23% in DevOps, 24% in computer graphics, and 29% in data analysis. This resistance to upgrade poses critical risks, as legacy software may contain unpatched vulnerabilities. The report also notes that organizations using outdated versions face compounding risks of inefficiency, escalating support costs, and regulatory noncompliance.

Key Insight: The Need for Standardized Naming in FOSS

“There are promising efforts to implement a standardized naming schema for software components which would improve supply chain security and future census efforts.” [4]

A lack of standardized naming for FOSS components creates inefficiencies and security gaps. The Census III report notes how inconsistent naming conventions complicate dependency management and hinder software supply chain transparency. Proposed solutions, such as Package URL (PURL) and cryptographic hashes, offer promising paths forward. These approaches simplify project tracking, enhance collaboration across ecosystems, and mitigate risks in managing cross-platform dependencies critical for operational security.

Key Insight: The Role of Individual Contributors in FOSS Development

“Among top non-npm¹ projects, 17% had only one developer and 40% had one or two developers accounting for more than 80% of commits authored.” [5]

Despite its collaborative nature, much of FOSS development relies heavily on a small number of contributors. The report reveals that, during their review of 47 of the top 50 non-npm projects in 2023, they found that the majority of projects (64%) had four or fewer developers authoring 80% of commits. This concentration of responsibility introduces risks, particularly if maintainers face burnout or leave the project. The report emphasizes that these dependencies raise concerns about sustainability, scalability, and continuity in critical software ecosystems.

Key Insight: The Risks of Legacy Software in FOSS

“Legacy software persists in the open source space, making their security as important as their replacement packages.” [6]

Legacy FOSS software continues to play a significant role, even when better alternatives exist. For example, packages like minimist and request remain widely used despite being deprecated. This reliance on outdated software can introduce security vulnerabilities and reduce operational efficiency. The report highlights how these dependencies often become entrenched due to familiarity, lack of transition resources, and the complexity of updating integrated systems.

Why This Matters

For business leaders, understanding the findings of Census III is crucial for leveraging FOSS effectively while mitigating risks. The report emphasizes that businesses must invest in FOSS to sustain its role as a foundation of modern innovation. To ensure security and future advancements, the authors suggest sharing data on FOSS usage to improve transparency, coordinating efforts to adopt standardized naming and best practices, and investing in critical projects through funding, talent, and time. By embracing these strategies, organizations can support the FOSS ecosystem while strengthening their own resilience and competitive edge in a rapidly evolving digital economy.

Footnotes

(1) Non-npm refers to software packages, libraries, or components that are not managed or distributed through npm (Node Package Manager). npm is a widely used package manager for JavaScript, primarily associated with Node.js applications. Non-npm packages are hosted on alternative package management systems such as Maven (for Java), PyPI (for Python), NuGet (for .NET), Cargo (for Rust), or others tailored to specific programming languages or ecosystems.

References

[1] Frank Nagle, Kate Powell, Richie Zitomer, and David A. Wheeler, Census III of Free and Open Source Software (��ӽ� Business School, Laboratory for Innovation Science at ��ӽ�, and Open Source Security Foundation, December 2024): 1-187, 6.

[2] Nagle et al., Census III, 2.

[3] Nagle et al., Census III, 2.

[4] Nagle et al., Census III, 2.

[5] Nagle et al., Census III, 2.

[6] Nagle et al., Census III, 2.

Meet the Authors

is an Assistant Professor in the Strategy Unit at ��ӽ� Business School and a faculty affiliate of the Digital Data Design Institute and Laboratory for Innovative Science at ��ӽ�. Professor Nagle studies how competitors can collaborate on the creation of core technologies, while still competing on the products and services built on top of them – especially in the context of artificial intelligence. His research falls into the broader categories of the future of work, the economics of IT, and digital transformation and considers how technology is weakening firm boundaries. His work frequently explores the domains of crowdsourcing, free digital goods, cybersecurity, and generating strategic predictions from unstructured big data.

is the Program Manager at the Laboratory for Innovation Science at ��ӽ�. At LISH, she works closely with staff, faculty, and postdoctoral fellows to manage various projects and administrative processes. Before joining LISH, she worked as a Research Coordinator at Tufts University’s Center for Applied Brain and Cognitive Science where she worked with scientists from the U.S. Army’s Combat Capabilities Development Command to test the effects of stress on active duty soldiers . She graduated from ��ӽ� Graduate School of Education with an Ed.M. in Human Development and Psychology.

is a Predoctoral Fellow at ��ӽ� Business School working with the Strategy Unit. Before joining HBS he was a data scientist, most recently at Reddit and Coursera. He received a Master of Data Science from the University of British Columbia and a Bachelor of Arts in Philosophy, Politics & Economics from the University of Pennsylvania.

is an expert on open source software (OSS) and on developing secure software. His works on developing secure software include the Open Source Security Foundation (OpenSSF) Secure Software Development (LFD121) course. He is the Director of Open Source Supply Chain Security at the Linux Foundation and teaches a graduate course in developing secure software at George Mason University (GMU). Dr. Wheeler has a PhD in Information Technology, a Master’s in Computer Science, a certificate in Information Security, a certificate in Software Engineering, and a B.S. in Electronics Engineering, all from George Mason University (GMU).

The post Key Lessons from Census III on Open Source Software appeared first on ��ӽ� Business School AI Institute.

AI-Powered Core Earnings Analysis: A New Frontier in Financial Reporting

HBS AI Content & Learning — Wed, 18 Dec 2024 21:43:22 +0000

Analysis of financial disclosures for publicly traded companies can be a time consuming and costly process these days. In the study, “”, , an Assistant Professor at USC’s Marshall Business School, and , the Tandon Family Professor of Business Administration at ��ӽ� Business School, explore how large language models (LLMs) can be leveraged to estimate core earnings from corporate 10-K filings. Their research demonstrates the potential of AI to revolutionize financial analysis, offering a scalable and cost-effective approach to assessing firms’ persistent profitability.

Key Insight: With Proper Guidance, LLMs Can Effectively Estimate Earnings

“Our results offer empirical support for anecdotal claims that these models can fail when used ‘out of the box,’ on complex tasks without sufficient guidance; but can perform remarkably well when properly guided.” [1]

The researchers tested two approaches: a “lazy analyst” method, with minimal guidance, and a “sequential prompt” strategy that broke down the task into structured steps. Shaffer and Wang found that when given minimal instructions (the lazy analyst), LLMs often made conceptual errors in estimating core earnings. However, when provided with a structured, step-by-step approach (sequential prompt), the models produced valid core earnings measures that outperformed traditional metrics in predicting future earnings.

Key Insight: AI-Generated Core Earnings Measures Show High Persistence and Predictive Power

“[T]he sequential LLM prompt’s core earnings measure and Compustat’s OPEPS emerge as the top performers, with the highest predictive coefficients and R²’s. However, notably, when we extend the prediction horizon to average net income over the next two years, the sequential LLM-based measure surpasses all other measures.” [2]

Shaffer and Wang found that the AI-generated core earnings measure using the sequential-prompt approach better captured the persistent components of earnings that are reflected in market valuations over longer horizons. When predicting stock prices two years ahead, the Sequential Prompt Core Earnings per Share measure achieved an adjusted R²¹ of 0.7585, outperforming both Compustat measures and GAAP net income. This alignment between market valuations and stock prices seen at longer horizons, shows that the information available through the sequential prompt approach could be effectively utilized by investors when determining future stock prices.

Key Insight: LLM-Based Core Earnings Estimates Are Highly Cost-Effective

“The performance of this LLM-based measure — based on an API call costing less than one dollar and one minute of compute time on average per firm — is striking, particularly given the time- and cost-intensive processes associated with the alternatives.” [3]

One significant advantage of the LLM-based approach is its efficiency and cost-effectiveness. The authors note that their method produces core earnings estimates at a fraction of the cost and time required for traditional approaches, which could greatly reduce the expenses associated with processing and analyzing financial disclosures. The authors highlight that the advancements in ChatGPT-4 were pivotal in enabling the completion of an analysis like this one, as its predecessor, ChatGPT-3.5, lacked the necessary analytical capacity. They further suggest that, if the capabilities of LLM models continue to progress as anticipated, leveraging LLMs for such analyses could become a standard practice.

Why This Matters

Shaffer and Wang’s findings highlight the transformative potential of AI in financial analysis, particularly for decision-makers navigating the growing complexity of corporate disclosures. The AI-powered approach to core earnings estimation offers a scalable, cost-effective solution that delivers high-quality insights at a fraction of the time and cost of traditional methods. This innovation could be especially valuable for smaller firms and individual investors who lack access to expensive financial services, leveling the playing field in a data-intensive industry. Moreover, the AI-generated measures’ strong predictive power and correlation with market valuations suggest their utility in investment decisions, strategic planning, and performance evaluation.

However, the study emphasizes the importance of careful implementation when adopting AI tools for financial analysis. The stark difference in performance between the “lazy analyst” and “sequential prompt” approaches underscores the need for well-designed, structured prompts to harness the full potential of these technologies. As AI continues to evolve, it presents an opportunity for business leaders to integrate these tools into their processes, in order to enhance efficiency.

Footnotes

(1) R² (r-squared), also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance in a dependent variable that is explained by an independent variable or variables in a regression model. R² normally ranges from 0 to 1: an R² of 0 indicates that the model does not explain any of the variance in the dependent variable, an R² of 1 means that the model perfectly explains all the variance in the dependent variable. That is to say, generally, the closer the R² arrives to 1, the stronger the relationship between earnings predictions and future actual earnings and market valuation.

References

[1] Matthew Shaffer and Charles C.Y. Wang, “Scaling Core Earnings Measurement with Large Language Models”, (October 8, 2024): 1-45, 6.

[2] Shaffer and Wang, “Scaling Core Earnings Measurement with Large Language Models”, 4.

[3] Shaffer and Wang, “Scaling Core Earnings Measurement with Large Language Models”, 5.

Meet the Authors

is an Assistant Professor at USC’s Marshall Business School. He received his doctoral degree from ��ӽ� Business School and his bachelor’s degree from Yale. His research focuses on valuation and corporate governance, especially valuation practice in institutional settings such as M&A. His work has been published in the Journal of Financial Economics, and presented at leading conferences in accounting, finance, and law.

is the Tandon Family Professor of Business Administration at ��ӽ� Business School. He is a research member of the European Corporate Governance Institute (ECGI) and an associate editor of Management Science and Journal of Accounting Research, two leading management journals. His research and teaching focus on corporate governance and valuation.

The post AI-Powered Core Earnings Analysis: A New Frontier in Financial Reporting appeared first on ��ӽ� Business School AI Institute.

Why Being Together Still Matters for Innovation

HBS AI Content & Learning — Tue, 17 Dec 2024 19:16:51 +0000

Since the COVID-19 pandemic, the rise in remote work practices has inarguably created new business opportunities, such as the ability to hire the most qualified candidate for a position regardless of geographical location. However, research from , Assistant Professor at Purdue University, , Assistant Professor at the University of Michigan School of Information, , Dorothy & Michael Hintze Professor of Business Administration at the ��ӽ� Business School and co-founder and chair of the The Digital Data Design (D^3) Institute at ��ӽ�, and , a Max Palevsky Professor in Sociology at the University of Chicago, highlights the potential downsides of reduced in-person work. In ““, the authors discuss why being physically present—specifically, they look at universities and research institutes—has unique perks that virtual workplaces, so far, can’t replace.

The team used data from Clarivate’s Web of Science (WoS) database to systematically sample scholarly literature across 15 fields in physical sciences, life sciences, social sciences, and humanities, from papers published in 2000, 2005, and 2010. The survey measured influence (the extent to which a referenced paper influenced the citing author’s research choices) and knowledge (the respondent’s familiarity with the referenced paper). The team extracted institutional addresses for papers’ authors, geocoded these addresses using the Google Maps API, and calculated the distance between the institutions of each focal paper and corresponding referenced papers. To analyze intellectual distance, they encoded the title and abstract in a word embedding model using an unsupervised machine learning approach.

Key Insight: A Contradiction to Recent Claims

“Our findings alongside recent scholarship contradict recent commentary in the popular press.” [1]

The research team sets their findings up against recent popular-press articles that make oppositional claims, citing a New York Times piece in particular, which states that there is almost no data to support the productivity of serendipitous in-person encounters. This team says their work provides that missing data, demonstrating that, not only are in-person interactions positively impactful to researchers discovering papers but, also, that they lead to researchers discovering papers that are intellectually distant, which suggests an increased level of innovation and creativity.

Key Insight: Proximity Fuels Influence

“Sharing an institution is a critically important meso-scale for intellectual exposure and influence between the micro-scale of sharing an office, hallway, or department and the macro-scale of sharing a city, state, or country.” [2]

The research discusses the importance of the meso-scale, which they argue matters a great deal in facilitating opportunities for influence. That is to say, while previous studies have considered the importance of co-location on collaborative learning, this study highlights its importance on influence, whether or not individuals collaborate. The research team suggests that universities and research institutions excel at this type of co-locational knowledge transfer because they create opportunities for interactions between individuals from different departments and disciplines, through seminars, committees, shared spaces, and informal gatherings, exposing researchers to ideas they might not otherwise encounter, which in turn creates opportunities to share knowledge and fosters increased diversity among ideas.

Key Insight: Zone of Influence

“[I]f we hope to continue to fuel the engine of innovation, we will need to replace, and not simply displace, this essential but underappreciated mechanism of influence operating within our physical universities.” [3]

Although remote work has been effective, for example, in displacing meetings from the same geographical location to remote settings, it has been less effective in finding ways to replicate the impact that informal interactions have in influencing the spread of ideas between colleagues. The data shows that the casual conversations and serendipitous encounters often lead to big ideas and that informal meso-scale spaces give people a chance to share and, in so doing, stumble upon new insights.

Why This Matters

Are you a business leader considering whether or not, or how often, your team should return to the office? If so, this paper provides crucial insights. Not only does it suggest that a hybrid or in-person model is preferred to a fully remote one, it also suggests that firms should create in-person spaces that encourage cross-disciplinary engagement and informal connections. Regular cross-team meetups, open workspaces, and interdepartmental collaboration can catalyze the in-person interactions that drive innovation. Universities, which facilitate cross-disciplinary researcher interactions, offer a useful model that can aid business leaders in intentionally designing environments for unplanned, influential exchanges to ensure innovation thrives in hybrid work settings.

References

[1] Eamon Duede, Misha Teplitskiy, Karim Lakhani, and James Evans, “Being Together in Place as a Catalyst for Scientific Advance”, Research Policy, Volume 43, Issue 2 (March, 2024): 1-20, 11.

[2] Duede et al., “Being Together in Place as a Catalyst for Scientific Advance”, 11.

[3] Duede et al., “Being Together in Place as a Catalyst for Scientific Advance”, 12.

Meet the Authors

is an Assistant Professor at Purdue University in the Department of Philosophy. Before joining Purdue University, Duede was a Postdoctoral Fellow affiliated with the Digital Data Design Institute at ��ӽ�, and the Embedded EthiCS program in the Philosophy and Computer Science departments.

is an Assistant Professor at the University of Michigan School of Information and the head of DiscoveryLab. His research investigates the role of evaluation/selection methods in innovation, and how knowledge diffuses between scientists in-person and online.

is the Dorothy & Michael Hintze Professor of Business Administration at the ��ӽ� Business School. His innovation-related research is centered around his role as the founder and co-director of the and as the principal investigator of the NASA Tournament Laboratory. He is also the co-founder and chair of the The Digital Data Design (D^3) Institute at ��ӽ� and the co-founder and co-chair of the , a university-wide online program transforming mid-career executives into data-savvy leaders.

is the Director of the , a Fellow in the Computation Institute, and the Co-Director for the Masters in Computational Social Science Program. In addition to his leadership duties, Dr. Evans is a Max Palevsky Professor in Sociology at the University of Chicago with research that focuses on the collective system of thinking and knowing, ranging from the distribution of attention and intuition, the origin of ideas and shared habits of reasoning to processes of agreement (and dispute), accumulation of certainty (and doubt), and the texture—novelty, ambiguity, topology—of human understanding.

The post Why Being Together Still Matters for Innovation appeared first on ��ӽ� Business School AI Institute.

The Interplay of Integration and Delegation in Firm Organization

HBS AI Content & Learning — Thu, 12 Dec 2024 16:07:42 +0000

In today’s complex global economy, some businesses grapple with fundamental decisions regarding their structure and operations. A core challenge lies in determining when to vertically integrate suppliers and when to maintain external contracts. Recent research by ��ӽ� Business School Professor, , and a research team (see the bottom of the page for author details) that includes fellow HBS Professor and co-Principal Investigator of D^3’s Digital Reskilling Lab, , ““, illuminates the interplay between firm integration and the allocation of decision-making authority. By analyzing data from thousands of companies across 20 countries, the study offers new insights into how vertical integration and delegation are driven by value, uncertainty, and strategic flexibility.

Key Insight: A New Vision for Integration

“The “control over control” that comes with ownership helps guarantee the firm a minimum quality and quantity of inputs, and thereby introduces a novel mechanism of supply assurance as a rationale for integration.” [1]

Traditionally, the literature on firm boundaries has focused on how vertical integration helps to gain direct control over production, allowing firms to impose decisions that independent suppliers might resist, but which improve HQ’s productivity. In this study, the researchers expand on the traditional view by emphasizing the “control over control” aspect of integration, where ownership confers the right to reassign control rights, thus enabling HQ not only centralize control within HQ, but also gain more flexibility in delegating decision-making to the party best suited to handle production challenges as they emerge. This dynamic allocation of control is not possible with outsourcing, as non-integrated suppliers maintain their decision-making rights. Therefore, the researchers present a new perspective on supply assurance, where integration provides the option to strategically allocate control rights in response to the changing demands of production.

Key Insight: The Value Principle

“[F]irms delegate more decisions to integrated suppliers that produce more valuable inputs [and …] are more likely to integrate suppliers of more valuable inputs.” [2]

The researchers find that for suppliers that contribute more value to the final product, firms are both more likely to integrate those suppliers and to grant them more autonomy when integrated. Specifically, a one standard deviation increase in input value raises the probability of integration by 64% and increases delegation to integrated suppliers by 0.072 standard deviations. While the delegation standard deviation may appear small in magnitude, the researchers compare it to the effect of another variable, firm size, and find delegation’s relative effect to be noteworthy. Researchers suggest that firms are more likely to delegate to high-value suppliers because they want to leverage the suppliers’ expertise in troubleshooting production issues, and ultimately benefit from this specialized knowledge in their decision making.

Key Insight: What Role Uncertainty Plays

“Firms should be more likely to integrate suppliers that operate in riskier input industries.” [3]

The researchers find that firms are more likely to integrate suppliers in industries with greater productivity dispersion, or variability of labor productivity, among independent suppliers, as the increased uncertainty enhances “option value”. The value option of integration increases in situations where productivity dispersion is high because firms face a greater likelihood of encountering unexpected production challenges. Integration, then, provides a means to mitigate those challenges by allowing firms to either centralize control of the production process or delegate decision-making to the integrated supplier, depending on the nature of the problem.

Key Insight: The Relationship Between Integration and Delegation

“Our data also show that more vertically integrated firms tend to delegate more.” [4]

Contrary to the common assumption that integration and centralization go hand-in-hand, the researchers found a positive correlation between vertical integration and delegation at the firm level. The research team suggests this is likely due to the fact that high value firms are more likely to be integrated, and firms that are integrated are given more autonomy to delegate. This reflects the “value principle” – that input value and firm profitability shape multiple aspects of organizational design, and highlights the importance of studying integration and delegation jointly, rather than in isolation.

Why This Matters

For C-suite executives and business professionals, the insights from this research offer a roadmap for navigating the complexities of modern supply chains. Strategic integration is not just about ownership; it’s about maintaining flexibility and aligning decision-making authority with expertise. In industries facing rapid innovation and uncertainty, these principles are essential for maintaining a competitive advantage. By understanding the interplay between integration and delegation, leaders can make informed decisions that enhance organizational resilience and operational efficiency. In an era where adaptability is key, the ability to balance control with delegation is not just a strategic choice—it’s a business imperative.

References

[1] Laura Alfaro, Nick Bloom, Paola Conconi, Harald Fadinger, Patrick Legros, Andrew F. Newman, Raffaella Sadun, and John Van Reenen, “Come Together: Firm Boundaries and Delegation” Journal of the European Economic Association Volume 22, Issue no. 1 (February 2024): 34–72, 35.

[2] Alfaro et al., “Come Together: Firm Boundaries and Delegation”, 62

[3] Alfaro et al., “Come Together: Firm Boundaries and Delegation”, 61.

[4] Alfaro et al., “Come Together: Firm Boundaries and Delegation”, 63.

Meet the Authors

is the Warren Alpert Professor of Business Administration. At ��ӽ� since 1999, she served as Minister of National Planning and Economic Policy in Costa Rica from 2010-2012, taking a leave from HBS. She is Co-Editor of the Journal of International Economics and the World Bank Research Observer and Vice-President of LACEA, the Latin American and Caribbean Economist Association and a co-Chair of the World Economic Forum’s Global Future Council on the Future of Growth. She is also a Faculty Research Associate in the NBER International Finance and Macroeconomics (IFM) Program and the International Trade and Investment (ITI) Program, CEPR IFM program and co-Chair of the NBER’s Economics of Supply Chains conference, a joint effort with the Department of Homeland Security.

is the William Eberle Professor of Economics at Stanford University, a Senior Fellow of SIEPR, and the Co-Director of the Productivity, Innovation and Entrepreneurship program at the National Bureau of Economic Research. His research focuses on management practices and uncertainty. He previously worked at the UK Treasury and McKinsey & Company. He has a BA from Cambridge, an MPhil from Oxford, and a PhD from University College London.

is a Professor of Economics (Statutory Chair) at the University of Oxford and a Professorial Fellow at New College Oxford. Before joining Oxford, she was a Professor of Economics at the Université Libre de Bruxelles and a member of the European Center for Advanced Research in Economics and Statistics (ECARES). She obtained a B.A. in Political Science from the University of Bologna, a M.A. in International Relations from the School of Advanced International Studies (SAIS) of Johns Hopkins University, and a M.Sc. and Ph.D. in Economics from the University of Warwick.

is a Professor of Economics at the University of Vienna. He is also a Research Fellow in the areas Trade and Regional Economics and Climate Change and the Environment at the Centre for Economic Policy Research (CEPR), London/Paris, a Senior Member of the Mannheim Centre for Competition and Innovation (MaCCI), and a Senior Research Fellow at the Institute for Advanced Studies (IHS) Vienna.

is a Professor of Economics at the Université libre de Bruxelles (ECARES). His research interests are in the theory of contracts, microeconomics, industrial organization, competition policy, and regulation. He has taught courses in intermediate and graduate microeconomics, intermediate and graduate industrial organization and antitrust, graduate courses in contract theory.

is a Professor of Economics at Boston University and a Research Fellow of the CEPR in London. His research and teaching interests are in economic theory, especially as applied to understanding organizations, industries, inequality, and economic development. He is a Fellow of the Econometric Society, a Fellow of BREAD, and a co-founder and past President of ThReD. Before arriving at BU, he was Professor of Economics at University College London and held posts at Columbia and Northwestern. He has also had visiting positions at ��ӽ�, Princeton, the Institute for Advanced Study, Yale, and Brown.

is Charles E. Wilson Professor of Business Administration at ��ӽ� Business School, and is a Co-Chair of ��ӽ� Business School’s Project on Managing the Future of Work and co-PI of the Digital Reskilling Lab. Her research focuses on managerial and organizational drivers of productivity and growth in corporations and the public sector. She co-founded several large-scale projects to measure management practices and managerial behavior in organizations, such as the World Management Survey, the Executive Time Use Study, and the first large scale management survey in hospitals, MOPS-H, conducted in partnership with the US Census Bureau.

is the Ronald Coase Chair in Economics and a Professor in the Department of Economics at the London School of Economics and Political Science. His research expertise is in applied microeconomics with a focus on the causes and consequences of technological and managerial innovation especially with regards to the labor market.

The post The Interplay of Integration and Delegation in Firm Organization appeared first on ��ӽ� Business School AI Institute.

Performance and Metrics Archives | ���ӽ� Business School AI Institute

Smarter Memories, Stronger Agents: How Selective Recall Boosts LLM Performance

Key Insight: Accelerate or Anchor?

Key Insight: Selective Addition

Key Insight: Improvement through Deletion

Why This Matters

References

Meet the Authors

Teaching Trust: How Small AI Models Can Make Larger Systems More Reliable

Key Insight: The Three Pillars of AI Trustworthiness

Key Insight: Utilizing Novel Fine-Tuning Strategies

Key Insight: Challenges in Privacy Transfer

Why This Matters:

Footnote

References

Meet the Authors

The Future of Decision-Making: How Generative AI Transforms Innovation Evaluation

Key Insight: AI-Augmented Decisions Are More Stringent

Key Insight: Balancing Objectivity and Subjectivity in AI Collaboration

Key Insight: The Rise of AI Interaction Expertise

Why This Matters

References

Meet the Authors

Bridging the Gap Between Understanding and Control: Insights into AI Interpretability

Key Insight: Intervention as a Fundamental Goal of Interpretability

Key Insight: A Unified Framework for Interpretability and Control

Key Insight: Measuring Success Through Interventions

Why This Matters

Footnotes

References

Meet the Authors

The Promise and Pitfalls of AI in Strategic Decision-Making

Key Insight: AI Bias and Inconsistency

Key Insight: Aggregating AI Evaluations Improves Accuracy

Key Insight: Diversity and Scale Both Contribute to Improved AI Performance

Why This Matters

Footnotes

References

Meet the Authors

Promoting Fair Representation in AI Image Retrieval

Key Insight: Current Approaches Fall Short on Intersectional Representation

Key Insight: A New Metric for Multi-Group Representation

Key Insight: Scalability and Flexibility

Key Insight: Ethical and Practical Considerations

Why This Matters

References

Meet the Authors

Key Lessons from Census III on Open Source Software

Key Insight: The Rise of Cloud-Specific FOSS Packages

Key Insight: Persistent Challenges in Software Version Transitions

Key Insight: The Need for Standardized Naming in FOSS

Key Insight: The Role of Individual Contributors in FOSS Development

Key Insight: The Risks of Legacy Software in FOSS

Why This Matters

Footnotes

References

Meet the Authors

AI-Powered Core Earnings Analysis: A New Frontier in Financial Reporting

Key Insight: With Proper Guidance, LLMs Can Effectively Estimate Earnings

Key Insight: AI-Generated Core Earnings Measures Show High Persistence and Predictive Power

Key Insight: LLM-Based Core Earnings Estimates Are Highly Cost-Effective

Why This Matters

Footnotes

References

Meet the Authors

Why Being Together Still Matters for Innovation

Key Insight: A Contradiction to Recent Claims

Key Insight: Proximity Fuels Influence

Key Insight: Zone of Influence

Why This Matters

References

Meet the Authors

The Interplay of Integration and Delegation in Firm Organization

Key Insight: A New Vision for Integration

Key Insight: The Value Principle

Key Insight: What Role Uncertainty Plays

Key Insight: The Relationship Between Integration and Delegation

Why This Matters

References

Meet the Authors

Performance and Metrics Archives | ��ӽ� Business School AI Institute