A few weeks ago we announced support for running Apache Spark on top of YugaByte using the Apache Cassandra API.
Our fork ensures that the Spark partitions are based on the internal sharding of YugaByte tables and that Spark queries are efficiently routed to the right YugaByte node (read more about YugaByte data sharding in our docs).
To give it a try, update the package configuration for your existing YugaByte-based application:
Add the following snippet to your
<dependency> <groupId>com.yugabyte.spark</groupId> <artifactId>spark-cassandra-connector_2.10</artifactId> <version>2.0.5-yb-1</version> </dependency>
Add the following library dependency to your project configuration:
libraryDependencies += "com.yugabyte.spark" %% "spark-cassandra-connector" % "2.0.5-yb-1"
Start PySpark with:
$ pyspark --packages com.yugabyte.spark:spark-cassandra-connector_2.10:2.0.5-yb-1