At the onset of the big data age, there was an argument for doing schema-on-read, instead of schema-on-write. This encouraged practitioners to just dump their data in a data lake without having to worry about its schema. That was a blessing for many. However, it turns out that all data has metadata, we just got lazier and stopped labelling our data which made accessing it more difficult. Who knows what this data set contains? What schema does it follow? What version of the schema does it correspond to, in case it has evolved over time? And, now we seem to be retracting back to make sure that we add metadata to our data sets, metadata that is flexible enough to allow for multiple views of same information and evolvable enough to allow for changes in the future. Today's blog post is about one of the centerpieces of metadata in the big data ecosystem - the Hive metastore.
Apache Hive was, of course, the first SQL-on-Hadoop engine and with it came a small service that talked Thrift, called the Hive metastore service (or Hive metastore server). This service was backed by a relational database of the likes of PostgreSQL, MySQL, etc. and it houses metadata about big data. This includes information like what is the table name associated with a data set, its columns and types (so you can query them), where the data set is stored (on HDFS, S3, HBase, Cassandra, etc.), how it's stored, where applicable (using Parquet on HDFS, for example) and location of the data set, if applicable (/user/hive/warehouse/my_data_set on HDFS). It was then expanded to include further metadata to speed up queries. This started off with partitioning info (sub-directories in your data set, most of which could be filtered out if you chose the right where clause), bucketing and sorting info i.e how the data set was "bucketed" (in other words, hash partitioned) in files and how was the data sorted in those files. Bucketing and sorting metadata could cut down time considerably on certain SQL operations. For example, if you were joining two large tables, joining them had they been bucketed and sorted would be much faster (of the order of square of number of buckets) because the join can now be on done on individual buckets (assuming they have the same number of buckets) instead of the entire tables. Then, the Hive metastore was further expanded to store statistics about the data sets - how many records, histograms and distribution of particular keys in the data set so they could be used to optimize query planning.
But, now new SQL-on-Hadoop engines are popping up every day and Hive usage is being replaced by the likes of Spark SQL, Presto and others. However, the Hive metastore still remains the de-facto repository of metadata in the big data ecosystem. All SQL engines - Spark SQL, Presto, Impala, etc. integrate with Hive metastore - they are able to use it as their catalog of metadata and optionally statistics so disparate users with different software systems can have a single consistent view of metadata.
As you bring in big data practices in your organization, I encourage to think of what metadata you can store to speed up your queries or in some cases, get rid of queries (for example, hitting the metastore for the query, instead of actual data). Paramount to this is to configure your tools, to ensure that the metadata is up-to-date and being used to drive queries. And, as you do this, don't forget to leverage the de-facto metadata repository - Hive metastore.