Recently I encountered a problem when trying to use a large Stata file (nearly 10 gb). The file contained data for the period 1981 to 2011, but I only needed data for the period 1991 to 2009. To complicate matters, initially, I didn’t even know the names of the variables in the file, a problem that can be resolved with:
In this case, it turns out that knowing the variable names proved unimportant. Instead, after a bit of trial and error, I ended up importing batches of observations (1 million observations at a time). Below is the code for several such batches.
*STEP 1 clear use "1980-2011.dta" in 8000001 / 9000000 gen pct = round((shares / outstanding),.01) keep if pct >= .05 & pct != . compress save blockholders , replace *STEP 2 clear use "1980-2011.dta" in 9000001 / 10000000 gen pct = round((shares / outstanding),.01) keep if pct >= .05 & pct != . compress append using blockholders save blockholders , replace *STEP N
Step 1 imports a chunk of 1 million observations, and keeps only those in which an investor owns 5% or more of a particular company. About 22,000 out of one million observations meet this criterion. These ~22,000 observations are saved. In Step 2, the procedure is repeated, at which point another ~22,000 qualifying observations are appended to the blockholders file, and the file is saved again. Finally, the procedure is repeated N times until all the observations have been evaluated and only those relevant to my research project have been retained.