Hive Tutorial - Part 2

Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
  • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
  • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  • You want to use a custom location such as ASV.
  • Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
  • You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
  • The data is temporary.
  • You want Hive to completely manage the lifecycle of the table and data. 
Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put bacon.txt /user/demo/food/
 

 root@mahesh:/data# hadoop fs -ls /user/demo/food
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r--   1 root supergroup        124 2013-05-05 18:27 /user/demo/food/bacon.txt


Enter the Hive
root@mahesh:/data# hive

Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

hive> create table internal1(col1 string) LOCATION '/user/demo/food';
OK
Time taken: 0.996 seconds
 

That will return the time taken but no other result. Now let's look at the schema that was created:. Note that the table type is MANAGED_TABLE.
 
hive> describe formatted internal1;
OK
col_name    data_type    comment
# col_name                data_type               comment           
         
col1                    string                  None              
         
# Detailed Table Information         
Database:               employees               
Owner:                  root                    
CreateTime:             Sun May 05 18:18:53 IST 2013    
LastAccessTime:         UNKNOWN                 
Protect Mode:           None                    
Retention:              0                       
Location:               hdfs://localhost:54310/user/demo/food    
Table Type:             MANAGED_TABLE           
Table Parameters:         
    transient_lastDdlTime    1367758133        
         
# Storage Information         
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe    
InputFormat:            org.apache.hadoop.mapred.TextInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat    
Compressed:             No                      
Num Buckets:            -1                      
Bucket Columns:         []                      
Sort Columns:           []                      
Storage Desc Params:         
    serialization.format    1                 
Time taken: 0.724 seconds
 

 hive> select * from internal1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
Time taken: 1.25 seconds



What happens if we don't specify a directory for an INTERNAL table?
hive> CREATE TABLE internaldefault(col1 string);
It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):
dfs -lsr /user/hive/warehouse;

We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory: drwxr-xr-x   - root supergroup          0 2013-05-05 18:37 /user/hive/warehouse/employees.db/internaldefault

However, we won't use this table for the rest of the demo so let's drop it to avoid confusion. The drop also removes the subdirectory.
DROP TABLE internaldefault;
     dfs -lsr /user/hive/warehouse;

 Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let's add a 2nd file to the first internal table and check that it exists:

root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put bacon2.txt /user/demo/food/

hive> dfs -ls /user/demo/food;                 
Found 2 items
-rw-r--r--   1 root supergroup        124 2013-05-05 18:27 /user/demo/food/bacon.txt
-rw-r--r--   1 root supergroup         31 2013-05-05 18:47 /user/demo/food/bacon2.txt


Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output - there is no row value of col1 in the data as headers are not generally included in Hadoop data): 

hive> select * from internal1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
Time taken: 0.457 seconds


Now let's create an EXTERNAL table that points to the same directory and look at the schema:
hive> CREATE EXTERNAL TABLE extenal1(col1 STRING) LOCATION '/user/demo/food';
OK
Time taken: 0.306 seconds
 
hive> DESCRIBE FORMATTED external1;                                          
OK
# col_name                data_type               comment            
          
col1                    string                  None               
          
# Detailed Table Information          
Database:               employees                
Owner:                  root                     
CreateTime:             Sun May 05 19:13:36 IST 2013     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://localhost:54310/user/demo/food     
Table Type:             EXTERNAL_TABLE           
Table Parameters:          
    EXTERNAL                TRUE               
    transient_lastDdlTime    1367761416         
          
# Storage Information          
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe     
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat     
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:          
    serialization.format    1                  
Time taken: 0.163 seconds

 


This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS

View the Table data,
 
hive> SELECT * FROM external1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

 


You may create multiple tables for the same data set if you are experimenting with various structures/schemas.
Add another data file to the same directory and see how it's visible to all the tables that point to that directory:
root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put veggies.txt /user/demo/food/

SELECT * FROM internal1;
SELECT * FROM external1;
SELECT * FROM external2;
Each table will return the same results:
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456


Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:
DROP TABLE internal1;
SELECT * FROM external1;

h\ive> dfs -lsr /user/demo/food;
lsr: Cannot access /user/demo/food: No such file or directory.

Because the INTERNAL (managed) table is under Hive's control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!
Clean up the demo tables and directory:
DROP TABLE external1;

exit;
This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.
 
 SAMPLE FILES
Create new text file named bacon.txt , add the following content
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
 Create another new text file name bacon2.txt , add the following content
More_BaCoN
AndEvenMore_bAcOn

Veggies.txt
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456
 

Previous Post Next Post