Apache Spark has had support for running SQL queries on Spark for a while. In particular, you can share the metadata (names of tables, column names and types, storage format, etc.) with the Hive metastore. The latter has become the de-facto metadata repository in the Hadoop ecosystem.
In this brief post, I will share with you some Spark SQL gotchas (as of Apache Spark 1.5.1):
1. Don't run multiple SQL statements separated by semicolon.
Let's say you want to create a table named my_table but delete it if it exists, before you create it. You may want to write something like this on the spark-shell:
spark-shell> // Spark Shell creates a HiveContext for you, to launch SQL queries. // This context can be used by referring to the sqlContext variable. spark-shell> // Don't do this spark-shell> sqlContext.sql("DROP TABLE IF EXISTS my_table; CREATE TABLE my_table (col INT)")
In most SQL systems, you can separate 2 SQL statements with a semicolon and expect them to run one at at time, one after the other. However, in Spark SQL, they are treated just as one statement so the above doesn't work. So, you have to run each of the statements separately.
spark-shell> // Do this spark-shell> sqlContext.sql("DROP TABLE IF EXISTS my_table) spark-shell> sqlContext.sql("CREATE TABLE my_table (col INT)")
2. If you are writing your own Spark app (i.e. not using the spark-shell), be sure to create a HiveContext instead of a SQLContext object for running SQL queries.
val conf = new SparkConf().setAppName(this.getClass.toString) val sc = new SparkContext // Don't do this val sqlContext = new SQLContext(sc)
If you use SQLContext, you may get an error like:
Exception in thread "main" java.lang.RuntimeException: [1.90] failure: ``UNION'' expected but `left' found
This is because you are using SQLContext, instead of HiveContext. SQLContext implements a very small subset of the SQL dialect. Instead, it's recommended to write the query in HiveQL (which is exactly the same as "SQL" in most practical cases) and use HiveContext instead. HiveContext forwards most of the queries to Hive for parsing, thereby piggybacking off of existing parsing functionality in Hive, which is much more encompassing than the one implemented in SQLContext.
val conf = new SparkConf().setAppName(this.getClass.toString) val sc = new SparkContext // Do this val sqlContext = new HiveContext(sc)
Hope that was helpful!