Iskra is a Scala 3 wrapper library around Apache Spark API which allows writing typesafe and boilerplate-free but still efficient Spark code.
Starting from the release of 3.2.0, Spark is cross-compiled also for Scala 2.13, which opens a way to using Spark from Scala 3 code, as Scala 3 projects can depend on Scala 2.13 artifacts.
However, one might run into problems when trying to call a method requiring an implicit instance of Spark's Encoder
type. Derivation of instances of Encoder
relies on presence of a TypeTag
for a given type. However TypeTag
s are not generated by Scala 3 compiler anymore (and there are no plans to support this) so instances of Encoder
cannot be automatically synthesized in most cases.
Iskra tries to work around this problem by using its own encoders (unrelated to Spark's Encoder
type) generated using Scala 3's new metaprogramming API.
Iskra provides thin (but strongly typed) wrappers around DataFrame
s, which track types and names of columns at compile time but let Catalyst perform all of its optimizations at runtime.
Iskra uses structural types rather than case classes as data models, which gives us a lot of flexibility (no need to explicitly define a new case class when a column is added/removed/renamed!) but we still get compilation errors when we try to refer to a column which doesn't exist or can't be used in a given context.
//> using lib "org.virtuslab::iskra:0.0.3"
scala-cli repl --dep org.virtuslab::iskra:0.0.3
build.sbt
in an sbt project:libraryDependencies += "org.virtuslab" %% "iskra" % "0.0.3"
Iskra is built with Scala 3.1.3 so it's compatible with Scala 3.1.x and newer minor releases (starting from 3.2.0 you'll get code completions for names of columns in REPL and Metals!). Iskra transitively depends on Spark 3.2.0.
import org.virtuslab.iskra.api.*
given spark: SparkSession =
SparkSession
.builder()
.master("local")
.appName("my-spark-app")
.getOrCreate()
toTypedDF
extension method on a Seq
of case classes, e.g.Seq(Foo(1, "abc"), Foo(2, "xyz")).toTypedDF
typed
extension method on it with a type parameter representing a case class, e.g.df.typed[Foo]
In case you needed to get back to the unsafe world of untyped data frames for some reason, just call .untyped
on a typed data frame.
This library intends to maximally resemble the original API of Spark (e.g. by using the same names of methods, etc.) where possible, although trying to make the code feel more like regular Scala without unnecessary boilerplate and adding some other syntactic improvements.
Most important differences:
$.foo.bar
instead of $"foo.bar"
or col("foo.bar")
. Use backticks when necessary, e.g. $.`column with spaces`
..select(...)
or .select{...}
you should return something that is a named column or a tuple of named columns. Because of how Scala syntax works you can write simply .select($.x, $.y)
instead of select(($.x, $.y))
. With braces you can compute intermediate values like.select {
val sum = ($.x + $.y).as("sum")
($.x, $.y, sum)
}
foos.innerJoin(bars).on($.foos.barId === $.bars.id).select(...)
foos
and bars
were automatically inferredThis project is built using scala-cli so just use the traditional commands with .
as root like scala-cli compile .
or scala-cli test .
.
For a more recent version of Usage
section look here