Introduction
This article talks about how to write a UDF in Apache Hive and provides some tips, tricks and insights related to writing UDFs. 

What is a UDF?
Apache Hive comes with a set of pre-defined User Defined Functions (aka UDFs) available for use. A complete listing of Hive UDFs is available here. These UDFs are equivalent of functions like hex(), unhex(), from_unixtime(), unix_timestamp() in MySQL. The Hive community has made an effort to make most of the commonly used UDFs available to users. However, often times, the existing UDFs are not good enough for us and users want to write their own custom UDF. This article goes through that process.

UDF vs. Generic UDF
Hive UDFs are written in Java. In order to create a Hive UDF you need to derive from one of two classes UDF or GenericUDF. Here is a small comparison table identifying the differences between writing a UDF that derives from UDF class vs. GenericUDF class:

UDF GenericUDF
Easier to develop A little more difficult to develop
Lower performance due to use of reflection Better performance because use of lazy evaluation and short-circuiting
Doesn't accept some non-primitive parameters like struct Supports all non-primitive parameters as input parameters and return types
(Thanks to Steve Waggoner for suggesting 2 corrections to the above table.)

So, why am I telling you all this?
Deriving your UDF from UDF class will make you develop faster but the code wouldn't be scalable and arguably less performing. Using GenericUDF has some learning curve to it but will allow your UDF to be more scalabale. Moreover, this article aims to reduce the steepness of that learning curve to make GenericUDF use as easy as UDF class' use.

My recommendation: Read along and use GenericUDF.

Sample Code
The code we are going to use as a reference for learning how to write a UDF is my code for a UDF that I created, translate(). The code is available in Apache Hive's trunk.

We wouldn't go into the nitty gritty of what this code does but learn about the general semantics of writing a UDF using GenericUDF. You need to overwrite 3 methods: initialize(), evaluate() and getDisplayString().

Annotations
In general, you should annotate your UDF with the following annotations (replace the values of various parameters based on your use case):
  • @UDFType(deterministic = true)
A deterministic UDF is one which always gives the same result when passed the same parameters. An example of such UDF are length(string input), regexp_replace(string initial_string, string pattern, string replacement), etc. A non-deterministic UDF, on the other hand can return different result for the same set of parameters. For example, unix_timestamp() returns the current timestamp using the default time zone. Therefore, when unix_timestamp() is invoked with the same parameters (no parameters) at different times, different results are obtained, making it non-deterministic. This annotation allows Hive to perform some optimization if the UDF is deterministic.
  • @Description(name="my_udf", value="This will be the result returned by explain statement.", "This will be result returned by the explain extended statement.")
This annotation tells Hive the name of your UDF. It will also be used to populate the result of queries like `DESCRIBE FUNCTION MY_UDF` or `DESCRIBE FUNCTION EXTENDED MY_UDF`

initialize()
This method only gets called once per JVM at the beginning to initilize the UDF. initilialize() is used to assert and validate the number and type of parameters that a UDF takes and the type of argument it returns. For example, translate udf takes in 3 string parameters and returns a string param. As you will see in the code, initialize() does the following:
  • Asserts that the expected number of parameters are received by the UDF
  • In the example, we were expecting all parameters to be of String type. Therefore, this method iterates through all the parameters and ensures that they are of  Primitive Category (non-primitive data types in Hive include Arrays, Maps and Structs). Once asserted, that primitives are sent as arguments, this method asserts that the primitive category of each of the parameters is a String or Void. We need to check for Void category since NULLs passed to the UDF are treated as Primitive datatypes with primitive category of VOID. If you are expecting an argument of non-primitive type, you may have to additional checks for asserting the primitive types the datatype encapsulates (e.g. ensuring that the array being sent as a parameter is an array of strings). TODO: be more precise here
  • Transforms the ObjectInspector array (length same as the number of arguments) received to a Converter array. These converters are statically stored and use to retrieve the values of each of the parameters in the evaluate() (more on that later).
  • Returns an ObjectInspector corresponding to the return type of the UDF. For returning a String output, we would return PrimitiveObjectInspectorFactory.writableStringObjectInspector.
evaluate()
This method is called once for every row of data being processed. This method would use the Converter array populated in initialize() to retrieve the value of the parameters for the row in question. You would then put the logic to compute the return value based on the value of the parameters and return that value. In the case of Translate UDF, a string is being returned.

getDisplayString()
A simple method for returning the display string for the UDF

Having implemented the above 3 methods, you are all set to build and test out your UDF.

Building the UDF
In order to build the UDF, you will need to have hive-exec*.jar in your Java build path. For example, if you are using maven, you will a snippet like the following in your pom.xml    

    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>0.9.0</version>
    </dependency>


You may change the version number to the version of Hive you are compiling against, although higher versions are backwards compatible.

Deploying and testing the UDF
Once your have built a JAR containing your UDF, you will have to run the following commands on Hive to make your UDF available for use:
hive> add jar my_udf.jar;
hive> create temporary function my_udf as 'com.me.udf.GenericMyUdf';
hive> select my_udf(col1, col2, col3) from my_table limit 5;

This page is a work in progress. More to come soon. 

==Summary==
==Miscellaneous== 
== Why use Text instead of String == 
== What order do the rows come to the UDF? ==

FAQ
* FAQs: Speed up UDF, how to write the initilize() and evaluate(). How to test your UDF
 


Comments

Nishant Kelkar
09/09/2013 12:25pm

"if (primitiveCategory != PrimitiveCategory.STRING
&& primitiveCategory != PrimitiveCategory.VOID)"

In the above condition which exists in your code, shouldn't you be checking with an OR condition? Like this, maybe?

"if (primitiveCategory != PrimitiveCategory.STRING
|| primitiveCategory != PrimitiveCategory.VOID)"

Thanks, let me know if I'm wrong and if so, why. Appreciate it! :)

Reply
09/09/2013 1:39pm

Nishant, I still think it should be AND.

The intent is to throw an exception if the type is not one of the types that is expected - the only expected types being string and void.

Now, if there were an OR instead of AND, an exception would be thrown even if string or void type is passed which is not the expected outcome.

The other way to see it is to use De Morgan's law (http://en.wikipedia.org/wiki/De_Morgan's_laws), we don't want to throw an exception if the type is string OR void. Therefore, we want to throw an exception when the type is not string AND not void.

Reply
Steve Waggoner
03/01/2014 9:38pm

Few comments:

I was able to pass an array of strings to a UDF. If your java parameter to is ArrayList<String> then it seem to convert Hive array of string (eg. return of SPLIT() function) correctly. You mentioned its not possible in your table but it appears to work fine.

I am not sure slower necessarily means less scalable. Hive compared with MySQL can be many times slower but it scales with more machines. I am sure UDF are less efficient and so slower by a percent than GenericUDF. But the readability of the code leaves a lot to be desired at the GenericUDF level.

Another thing that is nice about UDF over GenericUDF is you can overload the evaluate() function to support a different number of parameters. And even different types of parameters. I am sure you can do that with GenericUDF but it has to be less elegant. So, your partly wrong that you cannot support a different number of parameters in UDF, too.

Reply
09/01/2014 10:26am

Thanks for your comment. I will correct the post to reflect your suggestions and corrections.

To me, if the number of people using the UDF are going to be more than the number of people reading my code (which I likely to be the case for most of us), I'd choose to make my GenericUDF more performant, especially since I can make my code more readable by comments and good coding conventions. I do, however, understand that everyone's situation is different so I will let the readers make the call on which type of UDF they want to build.

Thanks again for thoroughly reading and posting. Appreciate the corrections!

Reply

Your comment will be posted after it is approved.


Leave a Reply

    Author

    Mark Grover is a Canadian computer engineer, runner and dancer.

    Categories

    All
    Hive