Digital Archive

❯

❯

Analytics Concepts

❯

File Formats

Source • Blame • Git History

Created: Jan 28, 2024 • Modified: Jan 28, 2024

160 words, 1 min read

analytics

RCF and ORC

RCF (Row Columnar File) old not used anymore
ORC (Optimized Row Columnar) has the highest compressing
Both the above file formats are by Facebook. And data is stored in columnar form

Parquet

Parquet by Cloudera (Used more in industry. Decompression is faster than ORC). Spark uses this format by default
There is no waste of space in these formats. They don’t allocate space for NULL and have an optimized way to handle duplicate data.
They are faster to read data from compared to textile

Avro

Avro (It is a data serialization system native to Hadoop. It is also language independent format)
.avsc (Avro Schema file) .avro (Avro Data).
Data internally stored as key, value pair similar to JSON
Uses RPC for data movement

Sequence File

Sequence file (Hadoop flat files which stores values in binary key-value pairs)

Hadoop File Formats, when and what to use?

New in Hadoop: Various File Format in Hadoop | Towards Data Science

Backlinks

Data Analytics

RCF and ORC
Parquet
Avro
Sequence File

Graph View

© 2025 David Varghese. Licensed CC BY 4.0. Quartz v4.5.0

GitHub
LinkedIn
Discord
Blog
Medium