Hive Tutorial - Part 2
Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
- The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
- Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
- You want to use a custom location such as ASV.
- Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
- You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
- The data is temporary.
- You want Hive to completely manage the lifecycle of the table and data.
root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put bacon.txt /user/demo/food/
root@mahesh:/data# hadoop fs -ls /user/demo/food
Warning: $HADOOP_HOME is deprecated.
Found 1 items
-rw-r--r-- 1 root supergroup 124 2013-05-05 18:27 /user/demo/food/bacon.txt
Enter the Hive
root@mahesh:/data# hive
Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:
hive> create table internal1(col1 string) LOCATION '/user/demo/food';
OK
Time taken: 0.996 seconds
That will return the time taken but no
other result. Now let's look at the schema that was created:. Note that
the table type is MANAGED_TABLE.
hive> describe formatted internal1;
OK
col_name data_type comment
# col_name data_type comment
col1 string None
# Detailed Table Information
Database: employees
Owner: root
CreateTime: Sun May 05 18:18:53 IST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://localhost:54310/user/demo/food
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1367758133
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.724 seconds
hive> select * from internal1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
Time taken: 1.25 seconds
What happens if we don't specify a directory for an INTERNAL table?
hive> CREATE TABLE internaldefault(col1 string);
It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):
dfs -lsr /user/hive/warehouse;
We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory: drwxr-xr-x - root supergroup 0 2013-05-05 18:37 /user/hive/warehouse/employees.db/internaldefault
However, we won't use this table
for the rest of the demo so let's drop it to avoid confusion. The drop
also removes the subdirectory.
DROP TABLE internaldefault;dfs -lsr /user/hive/warehouse;
Once we dropped the internaldefault
table the directory that Hive created was automatically cleaned up. Now
let's add a 2nd file to the first internal table and check that it
exists:
hive> dfs -ls /user/demo/food;
Found 2 items
-rw-r--r-- 1 root supergroup 124 2013-05-05 18:27 /user/demo/food/bacon.txt
-rw-r--r-- 1 root supergroup 31 2013-05-05 18:47 /user/demo/food/bacon2.txt
hive> select * from internal1;
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
Time taken: 0.457 seconds
Now let's create an EXTERNAL table that points to the same directory and look at the schema:
hive> CREATE EXTERNAL TABLE extenal1(col1 STRING) LOCATION '/user/demo/food';
OK
Time taken: 0.306 seconds
hive> DESCRIBE FORMATTED external1; OK
Time taken: 0.306 seconds
OK
# col_name data_type comment
col1 string None
# Detailed Table Information
Database: employees
Owner: root
CreateTime: Sun May 05 19:13:36 IST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://localhost:54310/user/demo/food
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1367761416
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.163 seconds
This time the table type is
EXTERNAL_TABLE. You can see that the location was expanded to include
the default settings which in this case are the localhost machine using
the default HDFS
View the Table data,
OK
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
You may create multiple tables for the same data set if you are experimenting with various structures/schemas.
Add another data file to the same directory and see how it's visible to all the tables that point to that directory:
root@mahesh:/home/mahesh/Desktop/Hadoop_Guide/sample DB/bacon# hadoop fs -put veggies.txt /user/demo/food/
SELECT * FROM internal1;
SELECT * FROM external1;
SELECT * FROM external2;
Each table will return the same results:
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456
Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:
DROP TABLE internal1;
SELECT * FROM external1;
h\ive> dfs -lsr /user/demo/food;
lsr: Cannot access /user/demo/food: No such file or directory.
lsr: Cannot access /user/demo/food: No such file or directory.
Because the INTERNAL (managed)
table is under Hive's control, when the INTERNAL table was dropped it
removed the underlying data. The other tables that point to that same
data now return no rows even though they still exist!
Clean up the demo tables and directory:
DROP TABLE external1;
exit;
This should give you a very
introductory level understanding of some of the key differences between
INTERNAL and EXTERNAL Hive tables. If you want full control of the data
loading and management process, use the EXTERNAL keyword when you create
the table.
SAMPLE FILES
Create new text file named bacon.txt , add the following content
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
Create another new text file name bacon2.txt , add the following content
More_BaCoN
AndEvenMore_bAcOn
AndEvenMore_bAcOn
Veggies.txt
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456