how do i load a fixed-width formatted file into hive? (with a little help from pig)

FEB 5, 2015 UPDATE: See the comments section in the original Confluence posting that this was ported from which identifies a better way to do this using FixedWidthLoader.

I was talking with a client earlier this week that is using the Hortonworks Sandbox to jumpstart his Hadoop learning (hey, that’s exactly why we put it out there)

Thanks to all the Sandbox tutorials out there he already knew how to take a delimited file, get it into HDFS, use HCatalog to build a table, and then run Hive queries against it.  What he didn’t know how to do was to load up a good old-fashioned fixed-width formatted file into a table.  My first hunch was to use a simple Pig script to convert it into a delimited file and I told him I’d pull together a simple set of instructions to help him with his Sandbox self-training.  Robert, here ya go!!

Create & Upload a Test File

To try this all out we first need a little test file.  Create a file called emps-fixed.txt on your workstation with the following contents (yes… the first two lines are part of the exploration, but they don’t break anything either).

12345678901234567890123456789012345678901234567890123456789012345678901234567890
EMP-ID    FIRST-NAME          LASTNAME            JOB-TITLE           MGR-EMP-ID
12301     Johnny              Begood              Programmer          12306     
12302     Ainta               Listening           Programmer          12306     
12303     Neva                Mind                Architect           12306     
12304     Joseph              Blow                Tester              12308     
12305     Sallie              Mae                 Programmer          12306     
12306     Bilbo               Baggins             Development Manager 12307     
12307     Nuther              One                 Director            11111     
12308     Yeta                Notherone           Testing Manager     12307     
12309     Evenmore            Dumbnames           Senior Architect    12307     
12310     Last                Sillyname           Senior Tester       12308     

Then upload the file into HDFS and just land it in your home directory on the Sandbox.

I’m taking some creative liberty and assuming that you’ve worked through (or at least understand the concepts of) some/most/all of the Sandbox tutorials.  For example, check out Tutorial 1 to see how to upload this file.  If at any point in this blog posting you are unfamiliar with how to do an activity, check back with the tutorials for help.  If you can’t find which one can help you, let me know in the comments section and I’ll reply with more info.

At this point, you should have a /user/hue/emps-fixed.txt file sitting in HDFS.

Build a Converter Script

Now we just need to build a simple converter script.  There are plenty of good resources out there on Pig including the tutorials; I’m even trying to build up a Pig Cheat Sheet.  For the one you’ll see below, I basically stripped down the info found at https://bluewatersql.wordpress.com/2013/04/17/shakin-bacon-using-pig-to-process-data/ (remember, a lazy programmer is a good programmer & imitation is the sincerest form of flattery) which is worth checking out.

Using the Sandbox lets create a Pig script called convert-emp which will start out with some plumbing.  Yes, that base “library” is called the piggy bank – I love the name, too!!

REGISTER piggybank.jar;
define SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

fixed_length_input = load '/user/hue/emps-fixed.txt' as (row:chararray);

Then build a structure using the SUBSTRING function to pluck the right values out of each line.  Lastly, just write this new structure into a file on HDFS.

employees = foreach fixed_length_input generate
    (int)TRIM(SUBSTRING(row, 0,   9)) AS emp_id,
         TRIM(SUBSTRING(row, 10, 29)) AS first_name,
         TRIM(SUBSTRING(row, 30, 49)) AS last_name,
         TRIM(SUBSTRING(row, 50, 69)) AS job_title,
    (int)TRIM(SUBSTRING(row, 70, 79)) AS mgr_emp_id;
    
store employees into '/user/hue/emps-delimited';

Verify the Output File

Now you can navigate into /user/hue/emps-delimited on HDFS and you’ll see the familiar (or not so familiar?) “part” file via the Sandbox’s File Browser which shows the following tab-delimited content.

You’re welcome to rerun the exercise without the first two rows, but again, they won’t cause any real harm due to the forgiving nature of Pig who tries its best to do what you tell it to do.  For the “real world”, you’d probably want to write some validation logic to make sure things like the value “EMP-ID” (from the source file) fails miserable when trying to be cast to an integer.

Load a Table and Run a Query

Now you need to create a table based on this delimited file (hint, see Tutorial 1 again for one way to do that) and then run a simple query to verify all works.  I ran the following HiveQL.

select emp_id, last_name, job_title
  from employees
 where mgr_emp_id = 12306;

It returned the following in the Sandbox.

There you have it – a super simple way to get a fixed-width file into a table that you can query from hive.

visiting the computer history museum (yes, i’m a geek)

In the effort to move to WordPress in 2020, I could not find a suitable conversion program to move my Confluence-hosted technical blog for all occasions. For this blog post, I’m including the following links.

Original Confluence blog post as well as a PDF copy of it.

fruITion and recrEAtion (a double-header book review)

Several years ago, Chris Potts wrote his two-part Information Technology and Enterprise Architecture story (or was it a warning?) in FruITion and RecrEAtion.  These stories are intended to express Chris’ belief of how IT/EA teams should be fully-engaged in the Strategy of the companies they belong to, and not just be a “cost center”.  He uses a novel format in both books as well as follows a character from the first novel into the second.

I have to admit I read them in the wrong order.  That said, I’m glad I did as I’m not sure I would have picked up recrEAtion if I finished fruITion first.  That’s not to say that it is dramatically different than the other one, but I really identified with recrEAtion in the first place because of my involvement in the EA function at my employer.  I wanted to get some additional viewpoints on EA which is often the most misunderstood (or least organized?) team within a technology shop.  

The first book, fruITion is centered on a CIO whose CEO went ballistic when she was presented a 78 page IT Strategy document full of business process maps, architecture blueprints and technology roadmaps.  The CEO really had already decided to dramatically shake-up the IT organization and the rest of the book was about how the CIO decided to respond to an approach that seemed totally against everything he had done, and known, his entire career.  I’ll let you read it to find out how he did.

The second book, recrEAtion focused on a character that left the company in the first book.  His role at a new company was to lead the Enterprise Architecture function.  He didn’t buy into the CEO’s thoughts at his prior company and was thrown another curve ball when his new CEO said, “sounds like you and I have the same job.”  Needless to say, this really caused the main character to question what was going on around him for having two strange experiences back to back.

The underlying theme in both of these novels is about IT/EA’s involvement in the “strategy” of the company – not just having an IT strategy that was centered on what was in place to day and IT’s understanding of what needed to be done next.  Both throw in some rather drastic changes to how IT/EA shops should be structured and aligned.  It goes past the well known “partner with the business” mantra and suggests to the reader to “be the business”.  In both stories, there is a strong push for measuring success by metrics – almost always financial.

While I’m not so sure many organizations are ready for the sweeping changes presented in these two stories, the push for technology teams to not be isolated, but very tightly integrated, is a good topic for discussion in almost any organization.  

I can recommend this book to senior technologists and formal technology leaders in your organizations, but would not recommend it to less seasoned and/or more narrowly focused employees, or even folks in “the business”, because the book overly simplifies (for brevity’s sake, not necessarily that the author does not understand) the complexity of a technology team of any size.  I’d hate for someone to get the impression that all IT/EA needs to do is talk “business strategy” and not be responsible for the normal “care and feeding” activities that are the responsibility of the non-innovation aspects of an IT organization.

As for me personally, the biggest thing I take away from these books (especially recrEAtion) is that we should stop imagining that we know where we need to be in five years (i.e. the “end-state architecture”) and focusing so hard on continuing to maintain an accurate roadmap of how to get there.  The author doesn’t suggest we should stop planning, but what I took from it was that we need to be constantly preparing multiple options of what we might do next and focus on the financial reasons our companies might want, or need, to go there.  If these options are truly aligned with our business needs then we will be more respected at the decision-making table and have a much better chance of not only being successful, but working on the “right” things for our enterprises.

are you a mort, elvis or einstein (or are these labels nonsense)?

I was recently discussing the old Microsoft personas of MortElvis and Einstein to some folks at work and was shocked that most had never heard of them.  You can easily google these three magical names and find several articles such as this onehere and yet another one (well… if you can read German).  Here they are in a nutshell.

  • Mort, the opportunist developer, likes to create quick-working solutions for immediate problems.  He focuses on productivity and learns as needed.
  • Elvis, the pragmatic programmer, likes to create long-lasting solutions addressing the problem domain, and learning while working on the solution.
  • Einstein, the paranoid programmer, likes to create the most efficient solution to a given problem, and typically learns in advance before working on the solution.

Well, which best describes you?  Can you use these labels to describe folks that you work with?  Or… are these just junk?

enterprise 2.0 book review (using web 2.0 technologies within organizations)

My prior employer had a company-wide MediaWiki instance along with other social/collaborative tools (blogs, forums, etc) in addition to the expected deployment of SharePoint.  To help span the various notifications, an internally-developed aggregator was created to roll up the various activity feeds that are produced into a social-oriented view.  At my current organization, this level of open authoring tooling is not as prevalent.

Coupled with a rollout of JIRA/Greenhopper, I introduced Confluence to help development projects with their documentation.  That said, we’ve seen an explosion of grass-roots content creation and collaboration.  Looking for information on this phenomenon led me to read Andrew McAfee’s Enterprise 2.0 which really addresses this subject matter well.  What’s “Enterprise 2.0”?  Well, check out this three minute video from the author.

If that sparked your interest, or you just have trouble falling asleep at night, check out my review of the book.

You can also grab the slides.

I’m continuing my push to enable & encourage further collaboration and I’d love to hear about your experiences of how collaborative tools such as wikis and blogs have made a difference in your organizations.

give as few orders as possible (encourage autonomy and responsibility)

I came across a great quote the other day; “Give as few orders as possible. Once you’ve given orders on a subject, you must always give orders on that subject.“  Brownie points to anyone who can source that quote (hint: it is from a SciFi book).

I’m not talking about Saint Philip Neri’s (paraphrased) quote of “he who wishes to be perfectly obeyed should give few orders” which itself was directed at government and its influence on citizens.  I’m talking about encouraging autonomy and responsibility by not micro-managing your team.

Working by this approach simply means to tell folks what you want done and get out of their way to allow them to find a workable solution by themselves.  Many of us might think we follow this approach, but it falls apart in a hurry when someone comes back with a solution that isn’t exactly what we were thinking the implementation would be.  It takes a solid leader to hold their tongue and simply ask questions that validate the solution presented actually solves the problem.  If it meets the requirements and was completed within whatever constraints have been established (company standards, team’s best practices, etc) then its a job well done.

If it doesn’t meet the requirements I highly encourage you to coach your team member in such a way that they are still in control of much of their destiny and how they choose to solve the problem.  Every time we tell a person how to solve a problem, we’re telling them to stop thinking on a problem like this as the official (aka “boss approved”) way to handle this particular case has been set.

I’m not just talking about what tools, patterns, frameworks, etc to use; I’m talking about how to give a coarse-grained problem to someone and letting them decompose the work effort into the needed tasks.  If you have to break down the problem into discrete tasks then that person will start looking for you to break it down that way for them in the future.  They’ll do this because they now feel like you want to micro-manage them in this way and they will now wait for you to give them direction on similar work in the future.  Your team is not going to scale like that.

This doesn’t mean we give our teams complete autonomy.  Again, there are constraints that make sense in your environment and this varies, but on the things that allow flexibility — give that flexibility back to your team members.  When you eliminate the flexibility by making choices for them, then the team will stop searching for a better way and just use your way going forward.  Worse yet, folks will start waiting for you to do the work you really want them to tackle and get better at.

Your team is bright and they want to help you. Give them an assignment and encourage them to tackle even tougher assignments by offering them as much flexibility as you can give and being careful to coach, not direct, them through the tough spots.