What is Strategy?

Why I Built This

I kept hearing people argue about what strategy is and isn't—whether the field is fracturing, whether causal methods are pushing young scholars toward unimportant questions, and so on. What frustrated me was that data was almost never brought to bear in these debates. So I decided to try to answer the questions empirically myself.

I gave myself 48 hours and Claude Code. This is what I came up with: a full paper and this website. I've made some updates since that initial 48 hours, but the site is still 100% built with Claude Code.

Here is a presentation I gave on the project.

Defining Strategy Research

Defining what counts as “strategy research” is the most difficult part of this project. Scholars disagree about the boundaries of the field, and there's no universally accepted definition. So I need to be upfront about how I made this judgment call.

The Baseline Assumption

My starting assumption is simple: if a paper is published in the Strategic Management Journal, it's a strategy paper. SMJ is a strategy-only journal, so everything in it counts as strategy by definition. That's my anchor.

The Challenge

I also collected papers from four other top management journals: Academy of Management Journal, Organization Science, Management Science, and Administrative Science Quarterly. But not every paper in these journals is about strategy—they cover organizational behavior, HR, operations, finance, and more.

The risk of anchoring only on SMJ is that relevant discussions might never make it into SMJ yet still be related to strategy. Even if a conversation happens primarily in Organization Science or ASQ, if it's semantically similar to what's in SMJ, it should be included here. This is inherently a judgment call.

The Method

To filter for strategy papers, I embedded every paper in semantic space using machine learning (OpenAI's text-embedding-3-large model) and computed how similar each non-SMJ paper is to the SMJ corpus. Papers that are semantically close to SMJ research are classified as strategy; papers that aren't are excluded.

Examples Near the Boundary

To give you a sense of how this works in practice, here are examples of papers right around the similarity threshold:

Included as strategy:

• Bernstein & Martínez-de-Albéniz (2016). “Dynamic Product Rotation in the Presence of Strategic Customers.” Management Science
• Ha & Tong (2008). “Contracting and Information Sharing Under Supply Chain Competition.” Management Science
• Schwab (2007). “Incremental Organizational Learning from Multilevel Information Sources.” Organization Science

Excluded (just below threshold):

• Loughry & Tosi (2008). “Performance Implications of Peer Monitoring.” Organization Science
• Chen, Choe & Matsushima (2019). “Competitive Personalized Pricing.” Management Science
• Nutt (1997). “How Decision Makers Evaluate Alternatives and the Influence of Complexity.” Management Science

The difference between these groups is subtle—they're separated by a hair's width in semantic space. Reasonable people could disagree about any individual paper near this boundary.

Clearly strategy (high similarity to SMJ):

• Arikan & Shenkar (2013). “National Animosity and Cross-Border Alliances.” Academy of Management Journal
• Greve & Zhang (2017). “Institutional Logics and Power Sources: Merger and Acquisition Decisions.” Academy of Management Journal
• Gubbi, Aulakh & Ray (2015). “International Search Behavior of Business Group Affiliated Firms.” Organization Science

Clearly not strategy (low similarity to SMJ):

• Hazen (2008). “An Extension of the Internal Rate of Return to Stochastic Cash Flows.” Management Science
• Zack (2000). “Jazz Improvisation and Organizing: Once More from the Top.” Organization Science
• Hong, Huang & Zhao (2019). “Sunk Cost as a Self-Management Device.” Management Science

Data Sources

The dataset includes research from five leading journals, but not every paper from every journal was included. For non-SMJ journals, there were typically two stages of filtering: first during PDF collection (to avoid downloading thousands of irrelevant papers), and then again after embedding using semantic similarity to SMJ.

Strategic Management Journal (SMJ)

2,954 papers from 1980–2024. All research articles included—SMJ is strategy-only, so everything published there counts as strategy by definition. No filtering applied.

Academy of Management Journal (AMJ)

1,286 strategy papers from 1980–2024. AMJ covers multiple fields (OB, HR, strategy).

Stage 1: Before downloading PDFs, I used an LLM (GPT-4o-mini) to classify each paper based on its title as either “strategy/OT/entrepreneurship” or “OB/other.” Only strategy-relevant papers were downloaded.

Stage 2: After embedding, papers were filtered again using semantic similarity to SMJ. About 78% of the pre-filtered papers passed this second stage.

Organization Science (OrgSci)

1,138 strategy papers from 1990–2024.

Stage 1: All papers downloaded (no pre-filtering).

Stage 2: Filtered using semantic similarity to SMJ.

Management Science (MS)

846 strategy papers from 1954–2024. MS covers operations, finance, marketing, and more—only about 30% of the journal is potentially relevant to strategy.

Stage 1: Papers were pre-filtered during collection using title keywords (strategy, competitive advantage, acquisition, alliance, governance, innovation, etc.).

Stage 2: After embedding, papers were filtered again using semantic similarity to SMJ.

Administrative Science Quarterly (ASQ)

326 strategy papers from 2000–2024. ASQ archives before 2000 were not available from the publisher, so earlier years are missing.

Stage 1: All papers downloaded (no pre-filtering).

Stage 2: Filtered using semantic similarity to SMJ.

Papers were collected as PDFs and processed through a custom extraction pipeline. Not every PDF could be processed successfully—some older papers had poor scan quality, unusual formatting, or missing abstracts. The final dataset includes papers where both title and abstract could be reliably extracted.

Methodology

Step 1: Text Extraction

Each PDF is processed using PyMuPDF to extract raw text and metadata. A custom section splitter identifies key sections (abstract, introduction, methods, results, discussion, references) using pattern matching tuned for academic paper formats across different eras—journal layouts changed significantly from the 1980s to today.

For each paper, I extract two text blocks: an abstract block (title + abstract) capturing what the paper is about, and a methods block (methodology section) capturing how the research was conducted.

Non-research content is excluded. The pipeline filters out editorials, book reviews, calls for papers, errata, retractions, editorial board listings, tables of contents, mastheads, author indexes, special issue introductions, and other front/back matter. Only research articles are included in the final dataset.

Before embedding, text is cleaned to remove citation markers (e.g., [12], Smith et al. 2020), URLs, DOIs, and email addresses. Text is truncated at 8,000 tokens if necessary to stay within model limits.

Step 2: Semantic Embeddings

Each text block is converted into a high-dimensional vector using OpenAI's text-embedding-3-large model, which produces 3,072-dimensional embeddings. These vectors capture semantic meaning—papers about similar topics end up close together in this high-dimensional space, even if they use different terminology.

The abstract embeddings are used for topic clustering and visualization. The methods embeddings power the methodology analysis and credibility revolution tracking.

Step 3: Filtering Non-SMJ Papers

Papers from AMJ, OrgSci, Management Science, and ASQ are filtered based on their semantic similarity to the SMJ corpus. For each non-SMJ paper, I compute its cosine similarity to every SMJ paper, then take the average similarity to the 10 nearest SMJ papers as its “strategy score.”

Papers with a strategy score of 0.50 or higher are classified as strategy research and included in the dataset. This threshold was chosen to balance inclusivity with precision—lower thresholds included too many clearly non-strategy papers; higher thresholds excluded papers that seemed reasonably strategy-relevant.

The k-nearest approach (rather than comparing to a single centroid) helps capture the full diversity of SMJ content. A paper doesn't need to be similar to “average” strategy research—it just needs to be similar to some substantial cluster of strategy papers.

Step 4: Dimensionality Reduction

To visualize 3,072-dimensional embeddings, I use UMAP (Uniform Manifold Approximation and Projection) to reduce them to 2D and 3D coordinates. UMAP preserves both local structure (similar papers stay close) and global structure (distinct research areas remain separated). The parameters (n_neighbors=15, min_dist=0.1, cosine metric) were tuned to balance cluster separation with faithful representation of the semantic landscape.

Step 5: Clustering

I apply two levels of clustering to identify research themes:

K-means (50 clusters) – Fine-grained topic groups that capture specific research streams
Hierarchical clustering (10 macro themes) – Broader categories that group related clusters into major areas of strategy research

Clustering is performed on the full 3,072-dimensional embeddings (not the reduced UMAP coordinates) to preserve maximum information.

Step 6: LLM-Based Labeling

Each cluster is labeled using GPT-4. I identify the 5 papers closest to each cluster's centroid (the most representative papers), send their abstracts to the model, and ask it to identify the common theoretical framework and methodology, then provide a concise 3–5 word label. This produces human-readable names like “Dynamic Capabilities & Adaptation” or “CEO Succession & Governance” that describe what each cluster is actually about.

Step 7: Variable Extraction

Beyond clustering, I use LLMs to extract structured variables from each paper:

Theories – RBV, transaction cost economics, institutional theory, agency theory, etc.
Methods – Regression types, difference-in-differences, instrumental variables, experiments, qualitative approaches
Topics – M&A, diversification, innovation, governance, internationalization, etc.
Data characteristics – Sample size, time period, geographic focus, data sources

Why Embeddings AND Citations?

This project uses both semantic embeddings and citation analysis—for different purposes.

Embeddings answer: “What is this paper about?”

Semantic embeddings capture the content of papers—what topics they address, what concepts they use. They're used here to define the boundaries of strategy research (similarity to SMJ) and to cluster papers into topics. A paper's position in the semantic landscape reflects what it's about, regardless of what it cites or who cites it.

Citations answer: “How do ideas flow?”

Citation data captures influence—which papers build on which other papers. They're used here to trace knowledge flows between research areas, identify bridge papers, and rank topics by influence. The Citations page shows how different research clusters cite each other, revealing the intellectual structure of the field.

Why not use citations alone?

Traditional field mapping often relies solely on citations (co-citation analysis, bibliographic coupling). This approach has limitations:

Citation lag – New papers have few citations, making recent work invisible
Strategic citing – Authors cite for many reasons beyond intellectual connection—status signaling, reviewer expectations, self-promotion
Doesn't capture content – Two papers on the same topic may not cite each other if they're from different eras or traditions
Citation clubs – Tight-knit communities cite each other heavily, creating artificial boundaries

Why not use embeddings alone?

Embeddings have their own limitations:

Misses intellectual lineage – Similar content doesn't mean direct influence—embeddings can't tell you who influenced whom
Black box similarity – It's harder to explain why two papers are similar in semantic space
Abstract-dependent – Papers are represented by their abstracts, which may not capture the full contribution

The hybrid approach

Use embeddings to map what the field studies—the content landscape. Use citations to map how ideas flow—the influence network. Together, they provide a richer picture than either alone.

Limitations

This project has several limitations worth noting:

The definition of “strategy research” is anchored to SMJ, which may exclude valid strategy work that differs stylistically from the SMJ norm
PDF extraction quality varies, especially for older papers with different formatting or scanned images
LLM-based classification, while generally accurate, can make errors on ambiguous cases
The dataset currently excludes books, working papers, and papers from journals outside the five included

Why I Built This

Here is a presentation I gave on the project.

Defining Strategy Research

The Baseline Assumption

The Challenge

The Method

Examples Near the Boundary

To give you a sense of how this works in practice, here are examples of papers right around the similarity threshold:

Included as strategy:

• Bernstein & Martínez-de-Albéniz (2016). “Dynamic Product Rotation in the Presence of Strategic Customers.” Management Science
• Ha & Tong (2008). “Contracting and Information Sharing Under Supply Chain Competition.” Management Science
• Schwab (2007). “Incremental Organizational Learning from Multilevel Information Sources.” Organization Science

Excluded (just below threshold):

• Loughry & Tosi (2008). “Performance Implications of Peer Monitoring.” Organization Science
• Chen, Choe & Matsushima (2019). “Competitive Personalized Pricing.” Management Science
• Nutt (1997). “How Decision Makers Evaluate Alternatives and the Influence of Complexity.” Management Science

The difference between these groups is subtle—they're separated by a hair's width in semantic space. Reasonable people could disagree about any individual paper near this boundary.

Clearly strategy (high similarity to SMJ):

• Arikan & Shenkar (2013). “National Animosity and Cross-Border Alliances.” Academy of Management Journal
• Greve & Zhang (2017). “Institutional Logics and Power Sources: Merger and Acquisition Decisions.” Academy of Management Journal
• Gubbi, Aulakh & Ray (2015). “International Search Behavior of Business Group Affiliated Firms.” Organization Science

Clearly not strategy (low similarity to SMJ):

• Hazen (2008). “An Extension of the Internal Rate of Return to Stochastic Cash Flows.” Management Science
• Zack (2000). “Jazz Improvisation and Organizing: Once More from the Top.” Organization Science
• Hong, Huang & Zhao (2019). “Sunk Cost as a Self-Management Device.” Management Science

Data Sources

Strategic Management Journal (SMJ)

2,954 papers from 1980–2024. All research articles included—SMJ is strategy-only, so everything published there counts as strategy by definition. No filtering applied.

Academy of Management Journal (AMJ)

1,286 strategy papers from 1980–2024. AMJ covers multiple fields (OB, HR, strategy).

Stage 2: After embedding, papers were filtered again using semantic similarity to SMJ. About 78% of the pre-filtered papers passed this second stage.

Organization Science (OrgSci)

1,138 strategy papers from 1990–2024.

Stage 1: All papers downloaded (no pre-filtering).

Stage 2: Filtered using semantic similarity to SMJ.

Management Science (MS)

846 strategy papers from 1954–2024. MS covers operations, finance, marketing, and more—only about 30% of the journal is potentially relevant to strategy.

Stage 1: Papers were pre-filtered during collection using title keywords (strategy, competitive advantage, acquisition, alliance, governance, innovation, etc.).

Stage 2: After embedding, papers were filtered again using semantic similarity to SMJ.

Administrative Science Quarterly (ASQ)

326 strategy papers from 2000–2024. ASQ archives before 2000 were not available from the publisher, so earlier years are missing.

Stage 1: All papers downloaded (no pre-filtering).

Stage 2: Filtered using semantic similarity to SMJ.

Methodology

Step 1: Text Extraction

For each paper, I extract two text blocks: an abstract block (title + abstract) capturing what the paper is about, and a methods block (methodology section) capturing how the research was conducted.

Step 2: Semantic Embeddings

The abstract embeddings are used for topic clustering and visualization. The methods embeddings power the methodology analysis and credibility revolution tracking.

Step 3: Filtering Non-SMJ Papers

Step 4: Dimensionality Reduction

Step 5: Clustering

I apply two levels of clustering to identify research themes:

K-means (50 clusters) – Fine-grained topic groups that capture specific research streams
Hierarchical clustering (10 macro themes) – Broader categories that group related clusters into major areas of strategy research

Clustering is performed on the full 3,072-dimensional embeddings (not the reduced UMAP coordinates) to preserve maximum information.

Step 6: LLM-Based Labeling

Step 7: Variable Extraction

Beyond clustering, I use LLMs to extract structured variables from each paper:

Theories – RBV, transaction cost economics, institutional theory, agency theory, etc.
Methods – Regression types, difference-in-differences, instrumental variables, experiments, qualitative approaches
Topics – M&A, diversification, innovation, governance, internationalization, etc.
Data characteristics – Sample size, time period, geographic focus, data sources

Why Embeddings AND Citations?

This project uses both semantic embeddings and citation analysis—for different purposes.

Embeddings answer: “What is this paper about?”

Citations answer: “How do ideas flow?”

Why not use citations alone?

Traditional field mapping often relies solely on citations (co-citation analysis, bibliographic coupling). This approach has limitations:

Citation lag – New papers have few citations, making recent work invisible
Strategic citing – Authors cite for many reasons beyond intellectual connection—status signaling, reviewer expectations, self-promotion
Doesn't capture content – Two papers on the same topic may not cite each other if they're from different eras or traditions
Citation clubs – Tight-knit communities cite each other heavily, creating artificial boundaries

Why not use embeddings alone?

Embeddings have their own limitations:

Misses intellectual lineage – Similar content doesn't mean direct influence—embeddings can't tell you who influenced whom
Black box similarity – It's harder to explain why two papers are similar in semantic space
Abstract-dependent – Papers are represented by their abstracts, which may not capture the full contribution

The hybrid approach

Use embeddings to map what the field studies—the content landscape. Use citations to map how ideas flow—the influence network. Together, they provide a richer picture than either alone.

Limitations

This project has several limitations worth noting:

The definition of “strategy research” is anchored to SMJ, which may exclude valid strategy work that differs stylistically from the SMJ norm
PDF extraction quality varies, especially for older papers with different formatting or scanned images
LLM-based classification, while generally accurate, can make errors on ambiguous cases
The dataset currently excludes books, working papers, and papers from journals outside the five included

About This Project

Why I Built This

Defining Strategy Research

The Baseline Assumption

The Challenge

The Method

Examples Near the Boundary

Data Sources

Methodology

Step 1: Text Extraction

Step 2: Semantic Embeddings

Step 3: Filtering Non-SMJ Papers

Step 4: Dimensionality Reduction

Step 5: Clustering

Step 6: LLM-Based Labeling

Step 7: Variable Extraction

Why Embeddings AND Citations?

Embeddings answer: “What is this paper about?”

Citations answer: “How do ideas flow?”

Why not use citations alone?

Why not use embeddings alone?

The hybrid approach

Limitations

About This Project

Why I Built This

Defining Strategy Research

The Baseline Assumption

The Challenge

The Method

Examples Near the Boundary

Data Sources

Methodology

Step 1: Text Extraction

Step 2: Semantic Embeddings

Step 3: Filtering Non-SMJ Papers

Step 4: Dimensionality Reduction

Step 5: Clustering

Step 6: LLM-Based Labeling

Step 7: Variable Extraction

Why Embeddings AND Citations?

Embeddings answer: “What is this paper about?”

Citations answer: “How do ideas flow?”

Why not use citations alone?

Why not use embeddings alone?

The hybrid approach

Limitations