Introduction This article talks about how to write a UDF in Apache Hive and provides some tips, tricks and insights related to writing UDFs. What is a UDF? Apache Hive comes with a set of pre-defined User Defined Functions (aka UDFs) available for use. A complete listing of Hive UDFs is available here. These UDFs are equivalent of functions like hex(), unhex(), from_unixtime(), unix_timestamp() in MySQL. The Hive community has made an effort to make most of the commonly used UDFs available to users. However, often times, the existing UDFs are not good enough for us and users want to write their own custom UDF. This article goes through that process. UDF vs. Generic UDF Hive UDFs are written in Java. In order to create a Hive UDF you need to derive from one of two classes UDF or GenericUDF. Here is a small comparison table identifying the differences between writing a UDF that derives from UDF class vs. GenericUDF class:
(Thanks to Steve Waggoner for suggesting 2 corrections to the above table.)
So, why am I telling you all this? Deriving your UDF from UDF class will make you develop faster but the code wouldn't be scalable and arguably less performing. Using GenericUDF has some learning curve to it but will allow your UDF to be more scalabale. Moreover, this article aims to reduce the steepness of that learning curve to make GenericUDF use as easy as UDF class' use. My recommendation: Read along and use GenericUDF. Sample Code The code we are going to use as a reference for learning how to write a UDF is my code for a UDF that I created, translate(). The code is available in Apache Hive's trunk. We wouldn't go into the nitty gritty of what this code does but learn about the general semantics of writing a UDF using GenericUDF. You need to overwrite 3 methods: initialize(), evaluate() and getDisplayString(). Annotations In general, you should annotate your UDF with the following annotations (replace the values of various parameters based on your use case):
initialize() This method only gets called once per JVM at the beginning to initilize the UDF. initilialize() is used to assert and validate the number and type of parameters that a UDF takes and the type of argument it returns. For example, translate udf takes in 3 string parameters and returns a string param. As you will see in the code, initialize() does the following:
This method is called once for every row of data being processed. This method would use the Converter array populated in initialize() to retrieve the value of the parameters for the row in question. You would then put the logic to compute the return value based on the value of the parameters and return that value. In the case of Translate UDF, a string is being returned. getDisplayString() A simple method for returning the display string for the UDF Having implemented the above 3 methods, you are all set to build and test out your UDF. Building the UDF In order to build the UDF, you will need to have hive-exec*.jar in your Java build path. For example, if you are using maven, you will a snippet like the following in your pom.xml <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>0.9.0</version> </dependency> You may change the version number to the version of Hive you are compiling against, although higher versions are backwards compatible. Deploying and testing the UDF Once your have built a JAR containing your UDF, you will have to run the following commands on Hive to make your UDF available for use: hive> add jar my_udf.jar; hive> create temporary function my_udf as 'com.me.udf.GenericMyUdf'; hive> select my_udf(col1, col2, col3) from my_table limit 5;
10 Comments
Nishant Kelkar
9/9/2013 02:25:20 am
"if (primitiveCategory != PrimitiveCategory.STRING
Reply
9/9/2013 03:39:08 am
Nishant, I still think it should be AND.
Reply
Steve Waggoner
1/3/2014 10:38:44 am
Few comments:
Reply
1/8/2014 11:26:46 pm
Thanks for your comment. I will correct the post to reflect your suggestions and corrections.
Reply
12/9/2014 07:16:44 am
There is a type-o in the annotation for the udf Description. the extended description should be of the form <key>, <value> like this:
Reply
codetexas
6/11/2015 12:56:45 pm
I still dont understand generic udf. This blog post is not complete. There is nothing on the internet that simply explains what the generic udf is. poor.
Reply
Uday
12/15/2015 11:39:36 pm
Hey Mark,
Reply
8/22/2016 10:42:42 pm
good article for learn hive UDF,
Reply
mohan
8/29/2016 04:49:08 am
Nice blog for hive udf thankyou...
Reply
Your comment will be posted after it is approved.
Leave a Reply. |
Mark's BlogTechnical writings by Mark Grover Categories |