The Apache ORC file format has been used heavily by Apache Hive for many years now, but being a bit of a “binary file format” there just isn’t much we can do with basic tools to see the contents of these files as shown below.
$ cat orcfile
ORC
P1
???>P>??be!Q%~.ע?d!?????T ?;
DoeSmith(P4??be!%..&wG!?? ?
??'LesterEricJohnSusie FdEBR F6PDoeMartinSmithGATXOKMA???
??]?M?Ku??????9?sT?#?ްͲ㖆O:^xh?>??FWe?Pve??桿F?Ӳ?LuS????b?` `??`???/p?_?]C?8???kQf?kpiqf??PB?K
(???쒟
X?X8X?9X?89.? Ź?????B"$?b4?`X?$???,??(???????#?????"Ŝ??"Ś????*Ś??KKR??8????
????b??a%???????Z??,?\Z??*????J?q1???s3K2$4??rVB@q..&wG!?? ???????"
(^0??ORC
Fortunately, the ORC project has a couple of options for CLI tools. For this posting, I settled on the Java Tools. Now, you could be a good citizen and build these yourself from source, but I (the lazy programmer that I am) decided to just download a compiled “uber jar” file.
First, I needed to figure out which version of ORC I was using. I am currently using HDP 3.1.0 and I took a peek into the Hive lib folder.
$ ls /usr/hdp/current/hive-client/lib/orc*
/usr/hdp/current/hive-client/lib/orc-core-1.5.1.3.1.0.0-78.jar
/usr/hdp/current/hive-client/lib/orc-shims-1.5.1.3.1.0.0-78.jar
The HDP jar file naming convention let me know I was using ORC 1.5.1, so I surfed over to http://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/ and then pulled down the appropriate file.
wget https://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/orc-tools-1.5.1-uber.jar
Now, I’m ready to use the tools, but… I realized I didn’t have an ORC file to test it out with, so I decided I would use Apache Pig to build a small file. I first created a simple CSV file with vi and then pushed it to HDFS. The contents of the file are as follows.
$ hdfs dfs -cat pig/customers.csv
1001,Lester,Martin,GA
1002,Eric,Martin,TX
1003,John,Doe,OK
1004,Susie,Smith,MA
I then wrote a little read & write conversion script and then executed it.
$ cat createORC.pig
custs = LOAD 'pig/customers.csv' USING PigStorage(',')
AS (cid:int, fname:chararray, lname:chararray, state:chararray);
STORE custs INTO 'orcfile' USING OrcStorage;
$ pig createORC.pig
As expected, it created a simple little ORC file which I pulled down to my linux home directory.
$ hdfs dfs -ls orcfile
Found 2 items
-rw-r--r-- 3 zeppelin hdfs 0 2019-12-12 08:35 orcfile/_SUCCESS
-rw-r--r-- 3 zeppelin hdfs 569 2019-12-12 08:35 orcfile/part-v000-o000-r-00000
$ hdfs dfs -get orcfile/part-v000-o000-r-00000 orcfile
$ ls -l orcf*
-rw-r--r--. 1 zeppelin hadoop 569 Dec 12 08:36 orcfile
NOW, we can finally try out the ORC Tools jar. First up, we can look at the metadata of this file.
$ java -jar orc-tools-1.5.1-uber.jar meta orcfile
Processing data file orcfile [length: 569]
Structure for orcfile
File Version: 0.12 with ORC_135
Rows: 4
Compression: ZLIB
Compression size: 262144
Type: struct<cid:int,fname:string,lname:string,state:string>
**** REMOVED CONTENT FROM THESE SECTIONS FOR BREVITY ****
Stripe Statistics:
File Statistics:
Stripes:
File length: 569 bytes
Padding length: 0 bytes
Padding ratio: 0%
That had some interesting info (and I definitely deleted a bunch to not be so verbose), but what this post was really about all this time is to show the contents of the file, so we just switch the subcommand.
$ java -jar orc-tools-1.5.1-uber.jar data orcfile
Processing data file orcfile [length: 569]
{"cid":1001,"fname":"Lester","lname":"Martin","state":"GA"}
{"cid":1002,"fname":"Eric","lname":"Martin","state":"TX"}
{"cid":1003,"fname":"John","lname":"Doe","state":"OK"}
{"cid":1004,"fname":"Susie","lname":"Smith","state":"MA"}
Perfect, we can see the four rows represented as JSON documents which is so much easier to read than that stuff we started out with originally.
This was originally posted at https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/12/12/1397686273/viewing+the+content+of+ORC+files+using+the+Java+ORC+tool+jar