Pivot a Hive table
Pivoting a table is a handy function in much of our data analysis, and one can do it easily with Excel or pandas, but there is no built-in Hive function that will do the same job for us. However, there are days one really wish such function exists, let me describe just one of them.
Let’s say we have a large sparse matrix, with a dimension of 1 million rows, and 1,000 columns. For each row, there are only 10 non-zero elements. To represent this matrix as a table in its natural form, there will be 1 billion (= 1 million x 1,000) entries required, that’s very wasteful for those many zeros. One solution is to represent the matrix in the long format, such that this table will have only 3 columns: row index, column index, and value. Only the non-zero elements of the matrix will be recorded in this table. Therefore, this table will have only 30 million (= 1 million x 10 x 3) elements. That is substantially smaller than the previous solution, and yet we don’t lose any information.
However, there are situations we would prefer the nice 1 million x 1,000 wide format, for example, in certain machine learning scenario. Here we need to convert the matrix from long format to the wide format. This is where pivoting comes into play, and below is how you can do it in Hive.
Let’s make a toy Hive table, to represent a sparse matrix in long format:
CREATE TABLE test_table
(
id BIGINT,
key1 STRING,
key2 STRING,
value INT
) STORED AS ORC;
INSERT INTO TABLE test_table VALUES
(1001, 'key_11', 'key_21', 111),
(1002, 'key_11', 'key_22', 222),
(1003, 'key_11', 'key_23', 333),
(1001, 'key_12', 'key_21', 111),
(1002, 'key_12', 'key_22', 222),
(1003, 'key_12', 'key_23', 333)
;
Here I spiced things up a little bit. The id
column corresponds to the row index, but the column index is represented as the combination of the key1
and key2
columns. In this way, if I know for sure (e.g., by design) that, there are only p
unique values of key1
, and q
unique values for key2
, then instead for spell out all the unique column indices multiplicatively (p x q
), I can just represent them additively (p + q
).
Method 1:
The first method is rather straightforward: since we know, a priori, all the column indices, let’s just scan the long table row by row, checking for this row index (id
in the toy table), what is the corresponding column index. Therefore, it will just be an exhaustive CASE
statements, such as the one below:
SELECT
id,
MAX(CASE WHEN key1 = 'key_11' AND key2 = 'key_21' THEN value ELSE NULL END) AS key_11_key_21,
MAX(CASE WHEN key1 = 'key_11' AND key2 = 'key_22' THEN value ELSE NULL END) AS key_11_key_22,
MAX(CASE WHEN key1 = 'key_11' AND key2 = 'key_23' THEN value ELSE NULL END) AS key_11_key_23,
MAX(CASE WHEN key1 = 'key_12' AND key2 = 'key_21' THEN value ELSE NULL END) AS key_12_key_21,
MAX(CASE WHEN key1 = 'key_12' AND key2 = 'key_22' THEN value ELSE NULL END) AS key_12_key_22,
MAX(CASE WHEN key1 = 'key_12' AND key2 = 'key_23' THEN value ELSE NULL END) AS key_12_key_23
FROM
test_table
GROUP BY
1
;
It will do the job just fine: the MAX
function is put in place as a trick to satisfy the GROUP BY
clause, since we assume there is only one value for each column index anyway. All those CASE
statements might seem daunting, but they can be easily populated by nested for
loops, with a scripting language such as python
. Below is the result, exactly what we wanted.
id key_11_key_21 key_11_key_22 key_11_key_23 key_12_key_21 key_12_key_22 key_12_key_23
-----+---------------+---------------+---------------+---------------+---------------+--------------
1001 111 NULL NULL 111 NULL NULL
1002 NULL 222 NULL NULL 222 NULL
1003 NULL NULL 333 NULL NULL 333
However, when I used this method on a real table, which has more than 3 billion rows, it took me a whopping 5+ hours to get the wide format table, which has about 50 million rows and 100 columns. I need a real pivot.
Method 2
Let me just show the query first:
SELECT
mapped.id,
COLLECT_LIST(group_map["key_11_key_21"])[0] AS key_11_key_21,
COLLECT_LIST(group_map["key_11_key_22"])[0] AS key_11_key_22,
COLLECT_LIST(group_map["key_11_key_23"])[0] AS key_11_key_23,
COLLECT_LIST(group_map["key_12_key_21"])[0] AS key_12_key_21,
COLLECT_LIST(group_map["key_12_key_22"])[0] AS key_12_key_22,
COLLECT_LIST(group_map["key_12_key_23"])[0] AS key_12_key_23
FROM
(
SELECT
id,
MAP(CONCAT_WS('_', key1, key2), value) AS group_map
FROM
test_table
) mapped
GROUP BY mapped.id
;
If you trust me, this will return the same results.
What is going on here? We have first make a sub-query, whose results is assigned with the alias mapped
. In this sub-query, we have only a simple MAP
function (this is the first trick), as well as the CONCAT_WS
function to stitch all the “key”s together, as the column index. Below is what mapped
looks like:
id group_map
-----+----------------------
1001 {"key_11_key_21":111}
1002 {"key_11_key_22":222}
1003 {"key_11_key_23":333}
1001 {"key_12_key_21":111}
1002 {"key_12_key_22":222}
1003 {"key_12_key_23":333}
Essentially, we have made a dictionary from each row, with just one key-value pair. Then in the outer query, as the second trick, we invoke the COLLECT_LIST
function to pick all values from the dictionaries, and then GROUP BY
with id
(row index). Again, here we assume:
- There is only one value for each (row index, column index)
- We can exhaustively populate all the column indices.
We then just pick the first element from the COLLECT_LIST
call, instead of using any other reduction function. This is based on the first assumption listed above. If we forgo the [0]
part, below is the result:
id key_11_key_21 key_11_key_22 key_11_key_23 key_12_key_21 key_12_key_22 key_12_key_23
-----+---------------+---------------+---------------+---------------+---------------+--------------
1001 [111] [] [] [111] [] []
1002 [] [222] [] [] [222] []
1003 [] [] [333] [] [] [333]
Each element now is a list, with just one or zero entry. If it so happened that there will be more than one entry, e.g., there are multiple values for the same (row index, column index), then one can probably do further aggregation, using another outer query with another GROUP BY
.
How does it work in practice? Using this method, for the same 3 billion+ row task, it took less than 40 minutes! Although the run time depends on the actual Hadoop cluster load, wchih can fluctuate, but given such a large difference, I can comfortably reject the null hypothesis, and happily adopt Method 2.
Leave a Comment