Hive Tutorial - Part 2( Internal Table and Extenal Table)

Hive Tutorial - Part 2

Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.

Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put bacon.txt /user/demo/food/

root@mahesh:/data# hadoop fs -ls /user/demo/food
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r-- 1 root supergroup 124 2013-05-05 18:27 /user/demo/food/bacon.txt

Enter the Hive

root@mahesh:/data# hive

Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

hive> create table internal1(col1 string) LOCATION '/user/demo/food';
OK
Time taken: 0.996 seconds

That will return the time taken but no other result. Now let's look at the schema that was created:. Note that the table type is MANAGED_TABLE.

hive> describe formatted internal1;
OK
col_name   data_type   comment
# col_name               data_type              comment

col1                   string                 None

# Detailed Table Information
Database:              employees
Owner:                 root
CreateTime:            Sun May 05 18:18:53 IST 2013
LastAccessTime:        UNKNOWN
Protect Mode:          None
Retention:             0
Location:              hdfs://localhost:54310/user/demo/food
Table Type:            MANAGED_TABLE
Table Parameters:
    transient_lastDdlTime   1367758133

# Storage Information
SerDe Library:         org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:           org.apache.hadoop.mapred.TextInputFormat
OutputFormat:          org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:            No
Num Buckets:           -1
Bucket Columns:        []
Sort Columns:          []
Storage Desc Params:
    serialization.format   1
Time taken: 0.724 seconds

hive> select * from internal1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
Time taken: 1.25 seconds

What happens if we don't specify a directory for an INTERNAL table?

hive> CREATE TABLE internaldefault(col1 string);

It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):

dfs -lsr /user/hive/warehouse;

We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory: drwxr-xr-x - root supergroup 0 2013-05-05 18:37 /user/hive/warehouse/employees.db/internaldefault

However, we won't use this table for the rest of the demo so let's drop it to avoid confusion. The drop also removes the subdirectory.

DROP TABLE internaldefault;

dfs -lsr /user/hive/warehouse;

Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let's add a 2nd file to the first internal table and check that it exists:

root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put bacon2.txt /user/demo/food/

hive> dfs -ls /user/demo/food;
Found 2 items
-rw-r--r--   1 root supergroup        124 2013-05-05 18:27 /user/demo/food/bacon.txt
-rw-r--r--   1 root supergroup         31 2013-05-05 18:47 /user/demo/food/bacon2.txt

Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output - there is no row value of col1 in the data as headers are not generally included in Hadoop data):

hive> select * from internal1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
Time taken: 0.457 seconds

Now let's create an EXTERNAL table that points to the same directory and look at the schema:

hive> CREATE EXTERNAL TABLE extenal1(col1 STRING) LOCATION '/user/demo/food';
OK
Time taken: 0.306 seconds

hive> DESCRIBE FORMATTED external1;
OK
# col_name                data_type               comment

col1                    string                  None

# Detailed Table Information
Database:               employees
Owner:                  root
CreateTime:             Sun May 05 19:13:36 IST 2013
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:54310/user/demo/food
Table Type:             EXTERNAL_TABLE
Table Parameters:
    EXTERNAL                TRUE
    transient_lastDdlTime    1367761416

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
    serialization.format    1
Time taken: 0.163 seconds

This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS

View the Table data,

hive> SELECT * FROM external1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

You may create multiple tables for the same data set if you are experimenting with various structures/schemas.

Add another data file to the same directory and see how it's visible to all the tables that point to that directory:

root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put veggies.txt /user/demo/food/

SELECT * FROM internal1;
SELECT * FROM external1;
SELECT * FROM external2;

Each table will return the same results:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456

Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:

DROP TABLE internal1;

SELECT * FROM external1;

h\ive> dfs -lsr /user/demo/food;
lsr: Cannot access /user/demo/food: No such file or directory.

Because the INTERNAL (managed) table is under Hive's control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!

Clean up the demo tables and directory:

DROP TABLE external1;

exit;

This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.

SAMPLE FILES

Create new text file named bacon.txt , add the following content

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon

Create another new text file name bacon2.txt , add the following content

More_BaCoN
AndEvenMore_bAcOn

Veggies.txt

SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456

Tech Insights

Hive Tutorial - Part 2

About Us

نموذج الاتصال