Sunday, May 10, 2009

Binary mode v.s. Text mode

1.1 Newline
a newline, also known as a line break or end-of-line (EOL) character, is a special character or sequence of characters signifying the end of a line of text.
Software applications and operating systems usually represent a newline with one or two control characters:

Systems based on ASCII or a compatible character set use either LF (Line feed, 0x0A) or CR (Carriage Return, 0x0D) individually, or CR followed by LF (CR+LF, 0x0D 0x0A); These characters are based on printer commands: The line feed indicated that one line of paper should feed out of the printer, and a carriage return indicated that the printer carriage should return to the beginning of the current line.

The C programming language provides the escape sequences '\n' (newline) and '\r' (carriage return). However, these are not required to be equivalent to the ASCII LF and CR control characters. The C standard only guarantees two things:

  1. Each of these escape sequences maps to a unique implementation-defined number that can be stored in a single char value.
  2. When writing a file in text mode, '\n' is transparently translated to the native newline sequence used by the system, which may be longer than one character. When reading in text mode, the native newline sequence is translated back to '\n'. In binary mode, the second mode of I/O supported by the C library, no translation is performed, and the internal representation of any escape sequence is output directly.
On Unix platforms, where C originated, the native newline sequence is ASCII LF (0x0A), so '\n' was simply defined to be that value. With the internal and external representation being identical, the translation performed in text mode effectively turns into a no-op, making text mode and binary mode behave the same. This has caused many programmers who developed their software on Unix systems simply to ignore the distinction completely, resulting in code that is not portable to different platforms.

1.2

Common problems

The different newline conventions often cause text files that have been transferred between systems of different types to be displayed incorrectly. For example, files originating on Unix or Apple Macintosh systems may appear as a single long line on a Windows system. Conversely, when viewing a file from a Windows computer on a Unix system, the extra CR may be displayed as ^M at the end of each line or as a second line break.

The problem can be hard to spot if some programs handle the foreign newlines properly while others don't. For example, a compiler may fail with obscure syntax errors even though the source file looks correct when displayed on the console or in an editor. On a Unix system, the command cat -v myfile.txt will send the file to stdout (normally the terminal) and make the ^M visible, which can be useful for debugging. Modern text editors generally recognize all flavours of CR / LF newlines and allow the user to convert between the different standards. Web browsers are usually also capable of displaying text files of different types.


1.3 Open modes in C++


You can control the way a file is opened by overriding the constructor’s default arguments. The following table shows the flags that control the mode of the file:

ios::in
Opens an input file. Use this as an open mode for an ofstream to prevent truncating an existing file.

ios::out
Opens an output file. When used for an ofstream without ios::app, ios::ate or ios::in, ios::trunc is implied.

ios::binary
Opens a file in binary mode. The default is text mode.

You can combine these flags using a bitwise or operation. The binary flag, while portable, only has an effect on some non-UNIX systems, such as operating systems derived from MS-DOS, that have special conventions for storing end-of-line delimiters. For example, on MS-DOS systems in text mode (which is the default), every time you output a newline character ('\n'), the file system actually outputs two characters, a carriage-return/linefeed pair (CRLF), which is the pair of ASCII characters 0x0D and 0x0A. Conversely, when you read such a file back into memory in text mode, each occurrence of this pair of bytes causes a '\n' to be sent to the program in its place. If you want to bypass this special processing, you open files in binary mode. Binary mode has nothing whatsoever to do with whether you can write raw bytes to a file—you always can (by calling write( )) . You should, however, open a file in binary mode when you’ll be using read( ) or write( ), because these functions take a byte count parameter. Having the extra '\r' characters will throw your byte count off in those instances. You should also open a file in binary mode if you’re going to use the stream-positioning commands discussed later in this chapter.

1.4 Conclusion

The representation of text files varies among operating systems. For example, the end of a line in a UNIX environment is represented by the linefeed character '\n'. On some other systems, such as Microsoft Windows, the end of the line consists of two characters, carriage return '\r' and linefeed '\n'. The end of the file differs as well on these two operating systems. Peculiarities on other operating systems are also conceivable.


To make programs more portable among operating systems, an automatic conversion can be done on input and output. The carriage return or linefeed sequence, for example, can be converted to a single '\n' character on input; the '\n' can be expanded to "\r\n" on output. This conversion mode is called text mode, as opposed to binary mode. In binary mode, no such conversions are performed.

The mode flag std::ios_base::binary has the effect of opening a file in binary mode. This has the effect described above; in other words, all automatic conversions, such as converting "\r\n" to '\n', are suppressed. [Basically, the binary mode flag is passed on to the respective operating system's service function, which means that in principle all system-specific conversions are suppressed, not only the carriage return / linefeed handling.]


If you must process a binary file, you should always set the binary mode flag, because most likely you do not want any kind of implicit, system-specific conversion performed.
The effect of the binary open mode is frequently misunderstood. It does not put the inserters and extractors into a binary mode, and hence suppress the formatting they usually perform. Binary input and output is done solely by basic_istream<>::read() and basic_ostream<>::write().

No comments: