Warning
DEPRECATED: Koalas supports Apache Spark 3.1 and below as it is officially included to PySpark in Apache Spark 3.2. This repository is now in maintenance mode. For Apache Spark 3.2 and above, please use PySpark directly.
Koalas: pandas API on Apache Spark¶
The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark. pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can:
Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).
We would love to have you try it and give us feedback, through our mailing lists or GitHub issues. Try the Koalas 10 minutes tutorial on a live Jupyter notebook here. The initial launch can take up to several minutes.
- Getting started
- User Guide
- Options and settings
- Working with pandas and PySpark
- Transform and apply a function
- Type Support In Koalas
- Type Hints In Koalas
- From/to other DBMSes
- Best Practices
- Leverage PySpark APIs
- Check execution plans
- Use checkpoint
- Avoid shuffling
- Avoid computation on single partition
- Avoid reserved column names
- Do not use duplicated column names
- Specify the index column in conversion from Spark DataFrame to Koalas DataFrame
- Use
distributed
ordistributed-sequence
default index - Reduce the operations on different DataFrame/Series
- Use Koalas APIs directly whenever possible
- FAQ
- What’s the project’s status?
- Is it Koalas or koalas?
- Should I use PySpark’s DataFrame API or Koalas?
- Does Koalas support Structured Streaming?
- How can I request support for a method?
- How is Koalas different from Dask?
- How can I contribute to Koalas?
- Why a new project (instead of putting this in Apache Spark itself)?
- API Reference
- Input/Output
- General functions
- Series
- Constructor
- Attributes
- Conversion
- Indexing, iteration
- Binary operator functions
- Function application, GroupBy & Window
- Computations / Descriptive Stats
- Reindexing / Selection / Label manipulation
- Missing data handling
- Reshaping, sorting, transposing
- Combining / joining / merging
- Time series-related
- Spark-related
- Accessors
- Date Time Handling
- String Handling
- Categorical accessor
- Plotting
- Serialization / IO / Conversion
- Koalas-specific
- DataFrame
- Constructor
- Attributes and underlying data
- Conversion
- Indexing, iteration
- Binary operator functions
- Function application, GroupBy & Window
- Computations / Descriptive Stats
- Reindexing / Selection / Label manipulation
- Missing data handling
- Reshaping, sorting, transposing
- Combining / joining / merging
- Time series-related
- Serialization / IO / Conversion
- Spark-related
- Plotting
- Koalas-specific
- Index objects
- Window
- GroupBy
- Machine Learning utilities
- Extensions
- Development
- Contributing Guide
- Design Principles
- Be Pythonic
- Unify small data (pandas) API and big data (Spark) API, but pandas first
- Return Koalas data structure for big data, and pandas data structure for small data
- Provide discoverable APIs for common data science tasks
- Provide well documented APIs, with examples
- Guardrails to prevent users from shooting themselves in the foot
- Be a lean API layer and move fast
- High test coverage
- Release Notes
- Version 1.8.2
- Version 1.8.1
- Version 1.8.0
- Version 1.7.0
- Version 1.6.0
- Version 1.5.0
- Version 1.4.0
- Version 1.3.0
- Version 1.2.0
- Version 1.1.0
- Version 1.0.1
- Version 1.0.0
- Version 0.33.0
- Version 0.32.0
- Version 0.31.0
- Version 0.30.0
- Version 0.29.0
- Version 0.28.0
- Version 0.27.0
- Version 0.26.0
- Version 0.25.0
- Version 0.24.0
- Version 0.23.0
- Version 0.22.0
- Version 0.21.0
- Version 0.20.0
- Version 0.19.0
- Version 0.18.0
- Version 0.17.0
- Version 0.16.0