viewing the content of ORC files (using the Java ORC tool jar)

The Apache ORC file format has been used heavily by Apache Hive for many years now, but being a bit of a “binary file format” there just isn’t much we can do with basic tools to see the contents of these files as shown below.

$ cat orcfile 
ORC
P1

       ???>P>??be!Q%~.ע?d!?????T	?;


DoeSmith(P4??be!%..&wG!??       ?
                                 ??'LesterEricJohnSusie	FdEBR	F6PDoeMartinSmithGATXOKMA???
??]?M?Ku??????9?sT?#?ްͲ㖆O:^xh?>??FWe?Pve??桿F?Ӳ?LuS????b?`	`??`???/p?_?]C?8???kQf?kpiqf??PB?K
                  (???쒟
X?X8X?9X?89.?   Ź?????B"$?b4?`X?$???,??(???????#?????"Ŝ??"Ś????*Ś??KKR??8????
             ????b??a%???????Z??,?\Z??*????J?q1???s3K2$4??rVB@q..&wG!?? ???????"
                                                                               (^0??ORC

Fortunately, the ORC project has a couple of options for CLI tools. For this posting, I settled on the Java Tools. Now, you could be a good citizen and build these yourself from source, but I (the lazy programmer that I am) decided to just download a compiled “uber jar” file.

First, I needed to figure out which version of ORC I was using. I am currently using HDP 3.1.0 and I took a peek into the Hive lib folder.

$ ls /usr/hdp/current/hive-client/lib/orc*
/usr/hdp/current/hive-client/lib/orc-core-1.5.1.3.1.0.0-78.jar
/usr/hdp/current/hive-client/lib/orc-shims-1.5.1.3.1.0.0-78.jar

The HDP jar file naming convention let me know I was using ORC 1.5.1, so I surfed over to http://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/ and then pulled down the appropriate file.

wget https://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.1/orc-tools-1.5.1-uber.jar

Now, I’m ready to use the tools, but… I realized I didn’t have an ORC file to test it out with, so I decided I would use Apache Pig to build a small file. I first created a simple CSV file with vi and then pushed it to HDFS. The contents of the file are as follows.

$ hdfs dfs -cat pig/customers.csv
1001,Lester,Martin,GA
1002,Eric,Martin,TX
1003,John,Doe,OK
1004,Susie,Smith,MA

I then wrote a little read & write conversion script and then executed it.

$ cat createORC.pig 
custs = LOAD 'pig/customers.csv' USING PigStorage(',')
          AS (cid:int, fname:chararray, lname:chararray, state:chararray);
STORE custs INTO 'orcfile' USING OrcStorage;
$ pig createORC.pig

As expected, it created a simple little ORC file which I pulled down to my linux home directory.

$ hdfs dfs -ls orcfile
Found 2 items
-rw-r--r--   3 zeppelin hdfs          0 2019-12-12 08:35 orcfile/_SUCCESS
-rw-r--r--   3 zeppelin hdfs        569 2019-12-12 08:35 orcfile/part-v000-o000-r-00000
$ hdfs dfs -get orcfile/part-v000-o000-r-00000 orcfile
$ ls -l orcf*
-rw-r--r--. 1 zeppelin hadoop 569 Dec 12 08:36 orcfile

NOW, we can finally try out the ORC Tools jar. First up, we can look at the metadata of this file.

$ java -jar orc-tools-1.5.1-uber.jar meta orcfile
Processing data file orcfile [length: 569]
Structure for orcfile
File Version: 0.12 with ORC_135
Rows: 4
Compression: ZLIB
Compression size: 262144
Type: struct<cid:int,fname:string,lname:string,state:string>

**** REMOVED CONTENT FROM THESE SECTIONS FOR BREVITY ****
Stripe Statistics:
File Statistics:
Stripes:

File length: 569 bytes
Padding length: 0 bytes
Padding ratio: 0%

That had some interesting info (and I definitely deleted a bunch to not be so verbose), but what this post was really about all this time is to show the contents of the file, so we just switch the subcommand.

$ java -jar orc-tools-1.5.1-uber.jar data orcfile
Processing data file orcfile [length: 569]
{"cid":1001,"fname":"Lester","lname":"Martin","state":"GA"}
{"cid":1002,"fname":"Eric","lname":"Martin","state":"TX"}
{"cid":1003,"fname":"John","lname":"Doe","state":"OK"}
{"cid":1004,"fname":"Susie","lname":"Smith","state":"MA"}

Perfect, we can see the four rows represented as JSON documents which is so much easier to read than that stuff we started out with originally.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

4 thoughts on “viewing the content of ORC files (using the Java ORC tool jar)

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading