Python
This part introduces Python, an open-source, high-level programming language that has become indispensable for empirical finance. I aim to provide readers with a foundational understanding of Python’s capabilities and how to leverage it for financial research. Starting from the basics, we will explore why Python is the go-to tool for analysts, researchers, and data scientists in finance.
What is Python?
Python is a versatile and user-friendly programming language that emphasizes readability and simplicity. Its design philosophy promotes code that is easy to write and understand, making it an excellent choice for both beginners and experienced programmers. The name Python also refers to the interpreter, the program that runs Python code.
Why Python for Finance?
Python has been widely adopted by the finance industry and finance researchers for several compelling reasons:
Open source
Python is free and open-source. Yes, that means you can use it for free, but open-source is much more than that. Python is distributed under the Python Software Foundation License, which is a permissive license that allows you to use, modify, and distribute the code and derived works based on it. This has led to a vibrant ecosystem of libraries and tools that are freely available to all, and to private forks of Python that are used internally by large financial institutions know collectively as bank Python.
Ease of learning
Python’s syntax is designed to be intuitive and human-readable, making it an accessible language for beginners and experts alike. This simplicity allows finance professionals—many of whom may not have a computer science background—to quickly learn Python and apply it to their work. Python code often reads like plain English, reducing the learning curve and enabling users to focus on problem-solving rather than struggling with the syntax.
For academic researchers, this ease of learning means that Python can be introduced in undergraduate or graduate programs with minimal friction. Students can rapidly transition from learning the basics of the language to applying it in real-world financial scenarios, such as data analysis, statistical modeling, and portfolio optimization.
Powerful
Python is exceptionally powerful due to its extensive library ecosystem. Libraries like NumPy, SciPy, and statsmodels provide robust tools for numerical and statistical computations, while pandas and polars facilitate data manipulation and analysis. These capabilities make Python an ideal choice for tasks ranging from simple data cleaning to complex econometric modeling.
In addition to its computational capabilities, Python integrates seamlessly with other programming languages and platforms, enabling finance practitioners to incorporate Python into larger, multi-language workflows. For example, it can call high-performance code written in C++ or Rust, interact with databases through SQL, or interface with scentific languages like R or Julia. This power and flexibility ensure that Python remains suitable for both small-scale analyses and enterprise-level financial systems.
Versatile
Python’s versatility allows it to handle a wide range of tasks, making it a one-stop solution for financial workflows. Analysts can use Python for tasks such as data acquisition from APIs or through web scraping, performing statistical analyses, creating visualizations, and even building predictive models using machine learning libraries like scikit-learn.
Widely-used
In 2024, Python surpassed Javascript to become the most widely used programming language in the world according to GitHub’s Octoverse. This rise in use is attributed to the growing importance of AI and data science, for which Python is the most popular language.
This popularity ensures that Python skills are highly transferable and in demand, making it a valuable asset for finance professionals. Large financial institutions, such as JPMorgan Chase and Goldman Sachs, use Python extensively for data analysis, trading algorithms, and risk modeling. Financial data providers such as Python-based platforms (formerly Refinitiv and Thomson Reuters) and WRDS, an academinc data provider, offer Python-based platforms for accessing and analyzing financial data. Python skills are now a must-have for finance professionals, so much so that the CFA Institute has added Python practical skills to its 2024 curriculum.
Other languages
Python is not the only game in town. R and Stata offer better capabilities for econometric and statistical modeling, and are also widely used in academic research. Julia and Matlab offer better performance for numerical computing, and C++ and Rust are the languages of choice for performance-critical parts of financial applications. Finally, SAS and many database softwares use a variant of SQL, while some use their own proprietary language like q for kdb+ which is a staple of high-frequency trading firms. Overall, each language has its strengths and weaknesses, and the choice of language depends on the specific task at hand, but Python is a very good choice for most tasks.
Components of the Python Ecosystem
The Python ecosystem consists of various tools and components that make it a powerful platform for data analysis. This section introduces the key elements of the ecosystem and their roles in creating efficient workflows.
Python interpreter
Python is an interpreted language, meaning that your code is executed by an interpreter when you run it rather than compiled ahead of time to an executable. The Python interpreter is responsiable for executing your Python code. It comes in various implementations, with the most common being CPython, the default implementation distributed with official Python releases, which is the one we will use in this book. Other variants include PyPy, a just-in-time (JIT) compiled interpreter that enhances performance for specific tasks, and Pyodide, a port of CPython to WebAssembly that allows Python to run in web browsers.
Python libraries
The base Python language is very simple, but it is extended through a large ecosystem of libraries. Python comes with a large standard library, that includes many features such as file input/output, basic data structures, and mathematical functions. However, most Python programs will leverage additional libraries. These libraries are pre-written modules that extend Python’s functionality. For empirical finance, some key libraries include:
- pandas and polars: For data manipulation and analysis.
- NumPy and SciPy: For numerical computations.
- matplotlib and seaborn: For data visualization.
- statsmodels and linearmodels: For econometric modeling.
- scikit-learn: For machine learning.
These libraries form the backbone of financial analysis in Python, enabling everything from basic calculations to complex statistical modeling.
Python libraries are published on PyPI, the Python Package Index. Anyone can publish a library to PyPI, so it is important to check the library’s reputation and documentation before using it. Always keep in mind that librairies contain code that will be executed on your computer, so they can contain malware. Well-known libraries are less likely to contain vulnerabilities, but you should always check the library’s documentation and assess its reputation before using it. We will discuss security best practices in more detail in future chapters.
Environment and package management
Python versions are updated frequently.1 Libraries follow their own release cycles and are not always updated at the same time as the Python interpreter. Some libraries depend on specific versions of other libraries, so a chain of dependencies may need to be updated. To complicate matters, updates to libraries may break your code. This means that code that worked last month may not work this month if you always use the latest version of everything.
To minimize these issues, the best practice is to create a virtual environment for each project. This ensures that the versions of the Python interpreter and libraries are fixed and consistent for each project, and that you can easily update the libraries for a project without affecting other projects.
There are several tools for managing Python environments. Python comes with venv, a built-in module for creating virtual environments, and pip, a package manager for installing and updating libraries. A popular alternative for data science projects is conda, part of the Anaconda distribution.2 Another popular tool, which I was using until recently, is poetry. Conda and poetry act as both an environment and package manager.
In this book, we will use uv, a tool that replaces venv, pip, and a suite of other tools. It brings a lot of modern features to the table, such as a lockfile to ensure reproducibility, ability to install and manage Python versions, and more. But for me, its main advantage is its increadible speed which is in part due to an advanced caching mechanism, the use of hardlinks to avoid keeping multiple copies of each library, and a very fast dependancy resolver.3
Integrated Development Environments (IDEs)
IDEs streamline coding by providing features such as syntax highlighting, debugging tools, and other tools to make your life easier. In this book, we will use Visual Studio Code (VS Code), an open-source code editor by Microsoft that is very popular in the data science community. It is not Python-specific, but it integrates seamlessly with Python through extensions. Because it is open-source, many forks have been created, such as Cursor, which adds a powerful AI engine to the editor, and Positron, a data science-oriented fork by Posit, the company behind RStudio (still in beta at the time of writing).
Other popular IDEs include PyCharm by JetBrains (commercial, but free for students and academics) and neovim, a terminal-based text editor that is very popular among developers for its extensibility. Finally, Spyder, is an open-source IDE that was very popular in the Python scientific community, but has since been eclipsed by VS Code.
Notebooks
Notebooks, such as Jupyter Notebooks, are interactive environments where code, text, and visualizations can coexist. They have significant drawbacks for their use in robust, replicable research, but are nonetheless very popular in data science. VS Code supports Jupyter notebooks natively with an extension. We will cover them in more details in this book, along with their shortcomings and ways to mitigate them.
marimo is a new Python notebook interface that aims to address some of the issues with Jupyter Notebooks. While it has a long way to go to overtake Jupyter in the data science community, it shows a lot of promise.
The latest version at the time of writing is 3.13. Since 2018, Python releases have been annual in October.↩︎
While conda is open-source, it uses by default the Anaconda Repository, which requires a paid subscription under some conditions. At the time of writing, it is free for academic use.↩︎
The main task of a package manager is to resolve dependencies, i.e. to figure out which versions of libraries need to be installed to satisfy the dependencies of a project. This is a very complex task (NP-hard) that requires a lot of computation and heuristics.↩︎