A few weeks ago we announced support for running Apache Spark on top of YugaByte using the Apache Cassandra API.
Today, we are excited to announce support for query locality with Apache Spark using our YugaByte fork of the Spark Cassandra Connector. You can find the deployed packages here.
Our fork ensures that the Spark partitions are based on the internal sharding of YugaByte tables and that Spark queries are efficiently routed to the right YugaByte node (read more about YugaByte data sharding in our docs).
To give it a try, update the package configuration for your existing YugaByte-based application:
Java/Maven:
Add the following snippet to your pom.xml
<dependency>
<groupId>com.yugabyte.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>2.0.5-yb-1</version>
</dependency>
Scala/sbt:
Add the following library dependency to your project configuration:
libraryDependencies += "com.yugabyte.spark" %% "spark-cassandra-connector" % "2.0.5-yb-1"
Python:
Start PySpark with:
$ pyspark --packages com.yugabyte.spark:spark-cassandra-connector_2.10:2.0.5-yb-1
If you don’t have an existing app, you can get started quickly by installing YugaByte and trying out our Spark sample apps.