Explore Data Science: Dynamic Partitioning With Hive

Thursday 12 June 2014

Dynamic Partitioning With Hive

Automatic (Dynamic) Partitioning With Hive

Partitioning in Hive is usually considered to be tedious task while loading the data. You have to manually segregate the data and have to tell Hive to which partition you want to to load data. This requires a lot of manual intervention and effort.

Further partitioning is usually done on the columns that does not exist in the data. What else if we want to partition on the column that exist in our data. I usually became clueless in this scenario.

Another problem is querying the one large HDFS file that contains historical data. If we can somehow divide the large file into small pieces we can get a faster turn around time while querying.

Consider a situation when we have one file that has two fields:ORDERID and ORDERDATE. Lets say each day we receive large number of orders. We want to view that file in the Hive but creating a one large table and querying the data is a heavy operation. So another way is to view this file as Hive Table partitioned by Date and all the partitions fills automatically depending upon value in the input file.

This problem can be solved by a two step process:

1) Set couple of properties in Hive

2)Create a external staging table "staging_order" and load the input files data to this table.

3) Create a main production external table "production_order" with the date as one of the partitioned columns.

4) Load the production table from the staging table so that data is distributed in partitions automatically.

4 comments:

Tanveer12 June 2014 at 03:49
Requesting you to please feel free to highlight mistakes and improvements in the blogs. It will make us to build it better. Further we appreciate your questions and will be happy to answer.
ReplyDelete
Replies
Unknown10 October 2014 at 13:08
This is fine with internal tables. But it is not the way we deal with external table. we don't need to load the files with using hive query. we set folder location while creating external table and then we dump data to that folder. so we don't need to run again LOAD command. this we use for internal tables. for internal table partitions information will update in metadata whenever you use LOAD command.

For external table we place data directly from source externally so hive doesn't know what partitioned data we dumped and it won't update metadata in meta store.

I read few blogs - we need to write a script for this to update meta data. check below links.

Till then we need to add partitions manually with using ALTER TABLE ADD PARTITION...

There is one thing apache says we can use MSCK REPAIR TABLE. But it is not working.
ReplyDelete
Replies
abdul quddos14 November 2021 at 00:44
When do you think this Real Estate market will go back in a positive direction? Or is it still too early to tell? We are seeing a lot of housing foreclosures in Altamonte Springs Florida. What about you? Would love to get your feedback on this. to_pydatetime
ReplyDelete
Replies
halime4 August 2023 at 18:27
God of War Benzeri Oyunlar
Asetto Benzeri Oyunlar
Dream League Soccer Benzeri Oyunlar
Days Gone Benzeri Oyunlar
TFT Benzeri Oyunlar
W7T
ReplyDelete
Replies

Add comment

Thursday 12 June 2014

Dynamic Partitioning With Hive

Automatic (Dynamic) Partitioning With Hive

4 comments:

Subscribe Us