Big-data hadoop question



I ask the question
Ideally what should be the block size in Hadoop cluster?


Typically the block size is 128 MB[default].You can change it in yarn-site.xml.


The total time to read data from a disk consists of ’seek time’ which is finding the first block of the file and then ’transfer time’ which is the time it takes to read contiguous blocks of data. When the system is dealing with hundreds of terabyte or petabyte data, the time it takes to read from disk is important. There isn’t much improvement that can be done to reduce ’seek time’. However, if the block size is large then a significant amount of data can be read in one seek. This doesn’t mean that larger the Block size the better.
Each block is processed by one mapper. So if there are fewer blocks then all the nodes in the cluster may not get used. So one needs to strike a balance. 128 MB has been found to be optimal. However, some applications may need larger or smaller Block size.