Polars Core Concepts
Expressions
Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.
What are Expressions?
An expression describes a transformation on data. It only materializes (executes) within specific contexts:
- select() - Select and transform columns
- with_columns() - Add or modify columns
- filter() - Filter rows
- group_by().agg() - Aggregate data
Expression Syntax
Basic column reference:
pl.col("column_name")
Computed expressions:
# Arithmetic
pl.col("height") * 2
pl.col("price") + pl.col("tax")
# With alias
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")
# Method chaining
pl.col("name").str.to_uppercase().str.slice(0, 3)
Expression Contexts
Select context:
df.select(
"name", # Simple column name
pl.col("age"), # Expression
(pl.col("age") * 12).alias("age_in_months") # Computed expression
)
With_columns context:
df.with_columns(
age_doubled=pl.col("age") * 2,
name_upper=pl.col("name").str.to_uppercase()
)
Filter context:
df.filter(
pl.col("age") > 25,
pl.col("city").is_in(["NY", "LA", "SF"])
)
Group_by context:
df.group_by("department").agg(
pl.col("salary").mean(),
pl.col("employee_id").count()
)
Expression Expansion
Apply operations to multiple columns at once:
All columns:
df.select(pl.all() * 2)
Pattern matching:
# All columns ending with "_value"
df.select(pl.col("^.*_value$") * 100)
# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES) + 1)
Exclude patterns:
df.select(pl.all().exclude("id", "name"))
Expression Composition
Expressions can be stored and reused:
# Define reusable expressions
age_expression = pl.col("age") * 12
name_expression = pl.col("name").str.to_uppercase()
# Use in multiple contexts
df.select(age_expression, name_expression)
df.with_columns(age_months=age_expression)
Data Types
Polars has a strict type system based on Apache Arrow.
Core Data Types
Numeric:
- Int8, Int16, Int32, Int64 - Signed integers
- UInt8, UInt16, UInt32, UInt64 - Unsigned integers
- Float32, Float64 - Floating point numbers
Text:
- Utf8 / String - UTF-8 encoded strings
- Categorical - Categorized strings (low cardinality)
- Enum - Fixed set of string values
Temporal:
- Date - Calendar date (no time)
- Datetime - Date and time with optional timezone
- Time - Time of day
- Duration - Time duration/difference
Boolean:
- Boolean - True/False values
Nested:
- List - Variable-length lists
- Array - Fixed-length arrays
- Struct - Nested record structures
Other:
- Binary - Binary data
- Object - Python objects (avoid in production)
- Null - Null type
Type Casting
Convert between types explicitly:
# Cast to different type
df.select(
pl.col("age").cast(pl.Float64),
pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
pl.col("id").cast(pl.Utf8)
)
Null Handling
Polars uses consistent null handling across all types:
Check for nulls:
df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())
Fill nulls:
pl.col("value").fill_null(0)
pl.col("value").fill_null(strategy="forward")
pl.col("value").fill_null(strategy="backward")
pl.col("value").fill_null(strategy="mean")
Drop nulls:
df.drop_nulls() # Drop any row with nulls
df.drop_nulls(subset=["col1", "col2"]) # Drop rows with nulls in specific columns
Categorical Data
Use categorical types for string columns with low cardinality (repeated values):
# Cast to categorical
df.with_columns(
pl.col("category").cast(pl.Categorical)
)
# Benefits:
# - Reduced memory usage
# - Faster grouping and joining
# - Maintains order information
Lazy vs Eager Evaluation
Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).
Eager Evaluation (DataFrame)
Operations execute immediately:
import polars as pl
# DataFrame operations execute right away
df = pl.read_csv("data.csv") # Reads file immediately
result = df.filter(pl.col("age") > 25) # Filters immediately
final = result.select("name", "age") # Selects immediately
When to use eager: - Small datasets that fit in memory - Interactive exploration in notebooks - Simple one-off operations - Immediate feedback needed
Lazy Evaluation (LazyFrame)
Operations build a query plan, optimized before execution:
import polars as pl
# LazyFrame operations build a query plan
lf = pl.scan_csv("data.csv") # Doesn't read yet
lf2 = lf.filter(pl.col("age") > 25) # Adds to plan
lf3 = lf2.select("name", "age") # Adds to plan
df = lf3.collect() # NOW executes optimized plan
When to use lazy: - Large datasets - Complex query pipelines - Only need subset of data - Performance is critical - Streaming required
Query Optimization
Polars automatically optimizes lazy queries:
Predicate Pushdown: Filter operations pushed to data source when possible:
# Only reads rows where age > 25 from CSV
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).collect()
Projection Pushdown: Only read needed columns from data source:
# Only reads "name" and "age" columns from CSV
lf = pl.scan_csv("data.csv")
result = lf.select("name", "age").collect()
Query Plan Inspection:
# View the optimized query plan
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age")
print(result.explain()) # Shows optimized plan
Streaming Mode
Process data larger than memory:
# Enable streaming for very large datasets
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("age") > 25).collect(streaming=True)
Streaming benefits: - Process data larger than RAM - Lower peak memory usage - Chunk-based processing - Automatic memory management
Streaming limitations: - Not all operations support streaming - May be slower for small data - Some operations require materializing entire dataset
Converting Between Eager and Lazy
Eager to Lazy:
df = pl.read_csv("data.csv")
lf = df.lazy() # Convert to LazyFrame
Lazy to Eager:
lf = pl.scan_csv("data.csv")
df = lf.collect() # Execute and return DataFrame
Memory Format
Polars uses Apache Arrow columnar memory format:
Benefits: - Zero-copy data sharing with other Arrow libraries - Efficient columnar operations - SIMD vectorization - Reduced memory overhead - Fast serialization
Implications: - Data stored column-wise, not row-wise - Column operations very fast - Random row access slower than pandas - Best for analytical workloads
Parallelization
Polars parallelizes operations automatically using Rust's concurrency:
What gets parallelized: - Aggregations within groups - Window functions - Most expression evaluations - File reading (multiple files) - Join operations
What to avoid for parallelization:
- Python user-defined functions (UDFs)
- Lambda functions in .map_elements()
- Sequential .pipe() chains
Best practice:
# Good: Stays in expression API (parallelized)
df.with_columns(
pl.col("value") * 10,
pl.col("value").log(),
pl.col("value").sqrt()
)
# Bad: Uses Python function (sequential)
df.with_columns(
pl.col("value").map_elements(lambda x: x * 10)
)
Strict Type System
Polars enforces strict typing:
No silent conversions:
# This will error - can't mix types
# df.with_columns(pl.col("int_col") + "string")
# Must cast explicitly
df.with_columns(
pl.col("int_col").cast(pl.Utf8) + "_suffix"
)
Benefits: - Prevents silent bugs - Predictable behavior - Better performance - Clearer code intent
Integer nulls: Unlike pandas, integer columns can have nulls without converting to float:
# In pandas: Int column with null becomes Float
# In polars: Int column with null stays Int (with null values)
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
# dtype: Int64 (not Float64)