EViews workfile format

Allin Cottrell <cottrell at wfu.edu>
Wake Forest University, July 2005
Updated, July 2011

Introduction

EViews is a popular proprietary econometrics program. It is widely used in teaching, and in various places around the Internet one can find datasets made available by publishers of textbooks or by professors in the EViews workfile format. It struck me that it would be useful if gretl could read this format. There does not appear to be any publicly available specification (not surprising for a proprietary binary format), so I decided to try reverse-engineering. This document sets out my findings. The findings are based on examination of several workfiles from different sources and dates (using Emacs in hexl mode, the strings program, and an exploratory reader program written in C), but I have no idea how general they are. I welcome any corrections or additions.

Update: The following discussion pertains to "New MicroTSP Workfile," the "traditional" EViews data format. With EViews 7, however, the program seems to have migrated to a new format, labeled "EViews File V01". This format resembles the old one in some respects but appears to be considerably more complex. Some notes on the new version appear in the appendix to this document. Note that the wf1 extension is used for both types of file.

Overview of format

An EViews workfile starts with an identifying string, "New MicroTSP Workfile," which seems to be padded out to 24 bytes with NUL characters. This is followed by a header of variable size, but within which certain key information seems to occur at fixed offsets. Then comes a series of 70-byte records, containing information on the data series in the file (and possibly information on other objects in some cases?). The central section of the file contains blocks of actual data, stored as doubles, and other information on the variables. The stream positions of these blocks are given in the preceding 70-byte records. All numbers seem to be stored in little-endian byte order. The examples I have seen also have a substantial swathe of NUL bytes in the central section. The file ends with a trailer section that includes the name of the file and strings representing the starting and ending observations.

Header section

As mentioned above, the header is of variable size. I'm unsure of exactly where the header ends and the series of 70-byte records begins, so I don't know the exact size of the header in any instance, but a common size seems to be 144 or 146 bytes (excluding the leading 24 bytes). In some files I've looked at the header is 32 bytes larger than this. The fields within the header that appear to be fixed are shown below (byte offsets are decimal and relative to the start of the file; lengths are in bytes). As you'll see, there's a lot here that doesn't yet make sense to me.

offset	length	comment
0	80	???
80	8	long: size of header
88	26	???
114	4	int: number of variables + 1
118	4	`time_t`: date of last modification of the file (or zero)
122	2	???
124	2	short: data frequency (e.g. 1 for annual, 4 for quarterly)
126	2	short: starting subperiod (for, e.g., quarterly or monthly data) or zero
128	4	int: starting observation (e.g. year)
132	8	??? (see below for update)
140	4	int: number of observations
144	variable	??? (mostly NULs)

The long at offset 80 gives a number that is closely related to the stream position of the start of the series of 70-byte variable records. For example, in some files the value is 144. Add 24 bytes for the initial identifier and you get 168. In such files, I've been able to start reading 70-byte records at byte 168. In other files, the value at offset 80 is 176, and one can read 70-byte records from byte 200 (= 176 + 24). But the alignment of those records looks a little funny and it seems "cleaner" if you start 2 bytes later (padding?).

The int at offset 114 seems to give the number of variables in the file plus one. Perhaps this number counts some other object that I haven't attempted to parse.

Update: it appears that to find the starting subperiod for quarterly or monthly data you should read a short at position 132 in the record, not at 126 as shown in the table above.

70-byte records for variables

These records contain stuff that I haven't been able to parse both at the beginning and the end, so I'm unsure of their precise alignment. The first clearly indentifiable element is an int that gives the size of the further record containing the data for the variable in question. The table below is based on the assumption that this element is located at a byte offset of 6 into the record. It's possible that the record starts two bytes earlier, in which case the offsets below would have to be augmented by 2.

offset	length	comment
0	6	???
6	4	int: size of data record (pointed to by the value at offset 14)
10	4	int: size of actual data block
14	8	long: stream position for further information plus actual data values
22	32	string: the name of the variable, padded to the right with NULs
54	8	long: stream position for "history" information, or zero if there's no history info
62	2	short: code representing the nature of the object?
64	6	???

The unknown material at the start and end of the record adds up to 12 "mystery bytes." Typically, these bytes seem to be identical for all the regular variables in a given workfile.

The "code" at offset 62, read as a short int, seems to be 44 for regular variables, and 43 for the special variable "C" (the constant). There may be other codes too. Following the pointer at offset 54, if it is non-zero, leads you to information on the "revision history" of the variable, and following the pointer at offset 62 leads you to an actual data block. These blocks are described in the following sections.

Note that each EViews workfile seems to contain two "boilerplate" variables: the constant "C" and a residuals series "RESID". If no equation has been estimated, "RESID" just contains missing values.

Revision history block

offset	length	comment
0	2	???
2	4	int: length of revision string
6	4	int: another length?
10	8	long: stream position of revision string

If you follow up at the stream position given in the last element, you find a string giving info on how the variable in question has been redefined. The length of this string is variable, and is apparently given by the second element above. In examples I've seen, the int at offset 6 has the same value as the one at offset 2. Perhaps it's a length that is inclusive of something else unknown, that happens to be zero in the cases I've looked at?

Data block

offset	length	comment
0	4	int: number of observations
4	4	int: starting observation
8	8	??? (NULs in cases I've seen)
16	4	int: ending observation
20	2	??? (NULs in cases I've seen)
22	variable	doubles: data values

From byte 22 onward, it seems one can find a number of doubles given by the first int field of the data block. Missing values ("NA") are represented by 1e-37.

Algorithm for reading the files

On the basis of the foregoing, here is my algorithm for reading data series out of an EViews workfile.

Check that file begins with the string "New MicroTSP Workfile," and reject it if it does not.
Read the basic dataset information from the header, using the offsets given in the first table above (number of variables, number of observations, frequency, starting observation, and so on). Let hlen denote the long int value read at offset 80, and let n denote the int value read at offset 114.
Advance the read position to hlen + 26, and start reading (n-1) 70-byte records. At each record, check that the "code" (short at offset 62) equals 44. If it's 43, skip forward 70 bytes, ignoring the constant. Else if it's not 44, either try skipping 70 bytes or abort: you've found something that is unknown on the basis of my investigations. Also check the 32-byte variable name at offset 22: if it's "RESID," skip 70 bytes forward because there's nothing of interest there.
If you got a "code 44" and the name is not "RESID," go to the stream position given by the long int at offset 14 in the variable record. Read the number of observations given by the int at offset 0 in this block. If this does not equal the global number of observations (call it T) given in the header, I'm not sure what to do: in all the examples I've looked at the values have been equal. Now go to offset 22 and try reading T doubles. Watch out for values of 1e-37 (missing) -- and also for NaN (not a number), which would indicate that you've somehow got off track.
After each successful read of a variable record, advance your basic reading position by 70 bytes.
When you've checked out n-1 records, stop.

And that's it for now.

Appendix: EViews File V01

Here are some remarks on the format used in EViews 7, identified by the first 15 bytes, "EViews File V01". The basic block structure described above carries over: I'm able to read the names of variables and their modification history as before. But extracting the data values is more problematic.

In the old format the values of each variable could be found as contiguous little-endian "doubles", located at a readily predictable stream position; in the new format this (which I'll call "standard mode") seems to be one of at least three possible modes of representing the data. I have found a couple of checks to determine whether standard mode is used. First, one can read a short int at offset 4 into the 70-byte variable record: it appears that this should have value 11 for standard mode. (In non-standard cases it often has value 60). A further check involves the data record size. The integer at offset 10 into a variable's 70-byte record gives the size of the actual data block: in standard mode this should be 8 times the the number of observations (since a double takes up 8 bytes); in other modes the size falls short.

A second mode of data representation in EViews File V01 involves a table of unique values for each variable. Presumably this must go along with sets of indices into these tables: something like, "for observation 1, use value 1; for observation 2, use value 2; for observation 3, use value 1 again" and so on. This tactic is reminiscent of the MS Excel binary format, as documented by OpenOffice.org. In the files I've examined such tables of unique values, when present, are located at an offset of either 40 or 140 bytes beyond the reading position that applies for full-length series (see above). But that's about all I've worked out at present.

I infer there's a third way of storing data but I don't know what it is. With some files of this sort you can't find the data values stored as standard doubles, either in full or in unique-values form (and neither can you find them as single-precision "floats"). It's possible some wacky floating-point format is used (Microsoft's?), or compression may be employed.

I'd be interested to hear if anyone is able to make sense of the new format, but I don't think it is going to be easy. Of course, it would be somewhat easier if you have a copy of EViews 7 to play with; I don't.