Session 3: Basic Stata Operations and Data Manipulation

Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.

Script Walkthrough


* -----------------------------------------------------
* Basic Stata Commands and Knowledge for New Users
* -----------------------------------------------------

* Clear the current environment (important when starting fresh)
clear all

* Stop Stata from pausing when output is long (e.g., large tables)
set more off

* -----------------------------------------------------
* Loading Built-in Data
* -----------------------------------------------------

* Load a built-in dataset provided by Stata (e.g., auto dataset)
sysuse auto

* Check the first few rows of the dataset
list in 1/10

* Show the structure of the dataset, including variable names, storage types, and labels
describe

* View the names and types of the variables, along with the total number of observations
codebook

* -----------------------------------------------------
* Understanding Variables and Data Structure
* -----------------------------------------------------

* Get a summary of all variables in the dataset (mean, std dev, min, max)
summarize

* Get summary statistics for specific variables (price, weight, mpg)
summarize price weight mpg

* -----------------------------------------------------
* Variable Labels and Value Labels
* -----------------------------------------------------

* Display the variable labels in the dataset
label list

* Display labels associated with variables (e.g., foreign cars)
label list origin

* List the unique values and their frequency for a categorical variable (foreign)
tabulate foreign

* -----------------------------------------------------
* Data Inspection: Simple and Structured Data Views
* -----------------------------------------------------

* List specific variables for the first 10 observations (make, price, mpg)
list make price mpg in 1/10

* Find summary statistics broken down by a category (e.g., summarize by foreign)
bysort foreign: summarize price mpg

* -----------------------------------------------------
* Sorting and Organizing Data
* -----------------------------------------------------

* Sort the data by the price variable in ascending order
sort price

* Display the first 10 observations sorted by price
list make price mpg in 1/10

* Sort the data by price in descending order (using gsort)
gsort -price

* Display the first 10 observations after sorting in descending order
list make price mpg in 1/10

* -----------------------------------------------------
* Basic Graphs and Data Visualization
* -----------------------------------------------------

* Create a scatter plot of price vs. weight
scatter price weight

* Create a histogram of mpg (miles per gallon)
histogram mpg

* -----------------------------------------------------
* Saving and Exporting Data
* -----------------------------------------------------

* Save the dataset in Stata format with a new name
save auto_copy.dta, replace

* Export the dataset to a CSV file
export delimited using "auto_data.csv", replace

* -----------------------------------------------------
* Basic File Management
* -----------------------------------------------------

* Check the current working directory
pwd

* Change the working directory (use your desired folder path)
cd "C:/Your/Desired/Directory"

* List all the files in the current working directory
dir

* -----------------------------------------------------
* Help and Resources
* -----------------------------------------------------

* Get detailed help on any Stata command (e.g., summarize)
help summarize

* Search for commands or functions related to specific tasks (e.g., regression)
search regression

* -----------------------------------------------------
* Conclusion of Basic Stata Commands
* -----------------------------------------------------

* This session covered essential Stata operations:
* - Loading data (`sysuse`)
* - Inspecting datasets (`describe`, `codebook`, `summarize`)
* - Sorting data (`sort`, `gsort`)
* - Simple graphs (`scatter`, `histogram`)
* - Managing files and directories (`save`, `export`, `pwd`, `cd`)
* - Using Stata's help system (`help`, `search`)
* -----------------------------------------------------






* -----------------------------------------------------
* Introduction to Data Manipulation in Stata
* -----------------------------------------------------

clear all
set more off

* Load the built-in dataset
sysuse auto

* Inspect the data
list in 1/10

* -----------------------------------------------------
* Part 1: Renaming Variables
* -----------------------------------------------------

* Rename mpg to Miles_Per_Gallon
rename mpg Miles_Per_Gallon

* Inspect the first few rows to check the new variable name
list make Miles_Per_Gallon in 1/10

* -----------------------------------------------------
* Part 2: Filtering Data (using `keep` and `drop`)
* -----------------------------------------------------

* Keep cars with more than 3 cylinders (equivalent to filter() in R)
keep if rep78 > 3

* List the first few rows after filtering
list make rep78 in 1/10

* drop variables that you won't use

drop foreign gear_ratio


* -----------------------------------------------------
* Part 3: Creating and Modifying Variables (equivalent to mutate() in R)
* -----------------------------------------------------

* Create a new variable: price per weight (price/weight)
gen price_per_weight = price / weight

* List the first few rows to check the new variable
list make price weight price_per_weight in 1/10

* -----------------------------------------------------
* Creating Multiple Variables
* -----------------------------------------------------

* Create two new variables:
* 1. price per weight (price/weight)
* 2. mpg_class: classify as "Efficient" if Miles_Per_Gallon > 20, otherwise "Non-efficient"
gen price_per_weight = price / weight
gen mpg_class = cond(Miles_Per_Gallon > 20, "Efficient", "Non-efficient")

* List the first few rows to check both variables
list make price_per_weight mpg_class in 1/10

* -----------------------------------------------------
* Conditional Mutate with `gen` and `cond()` (Equivalent to `case_when()` in R)
* -----------------------------------------------------

* Classify cars based on their weight using cond():
* Light (<2500), Medium (2500-3500), Heavy (>3500)
gen weight_class = cond(weight < 2500, "Light", cond(weight >= 2500 & weight < 3500, "Medium", "Heavy"))

* List the first few rows to check the new classifications
list make weight weight_class in 1/10

* -----------------------------------------------------
* Modifying Existing Variables
* -----------------------------------------------------

* Modify the price variable by creating a new categorical variable
* Classify price into categories "Low", "Medium", "High"
gen price_class = cond(price > 10000, "High", cond(price > 5000 & price <= 10000, "Medium", "Low"))

* List the first few rows to check the new variable
list make price price_class in 1/10

* -----------------------------------------------------
* Working with Multiple Variables
* -----------------------------------------------------

* Apply the `egen` function to calculate the row-wise mean of mpg and weight

* The term row-wise mean refers to calculating the mean (average) for each row of 
* specific columns in a dataset, rather than calculating the mean across the entire 
* column (which would be a column-wise mean). In other words, instead of computing a single
* average for all values in a variable, you compute the mean for a group of variables 
* within each observation (row).

egen mean_mpg_weight = rowmean(Miles_Per_Gallon weight)

* List the first few rows to check the new variable
list make Miles_Per_Gallon weight mean_mpg_weight in 1/10


* This variable mean_mpg_weight is not meaningful here, but just immagine that you want to calculate
* someone's average working hours within 3 days 
* Each row of data is one person, and you have variables Day1 Day2 and Day3 that records how many hours
* each person works in one day
* the code here will be:

egen mean_work_hours = rowmean(Day1 Day2 Day3)


* -----------------------------------------------------
* Conclusion: Key Stata Functions Demonstrated
* - rename: to rename variables
* - keep: to filter rows based on conditions
* - gen: to create new variables
* - egen: to perform calculations across variables
* - cond(): to apply conditional logic
* - drop: to remove variables
* -----------------------------------------------------

Output Preview: `auto_data.csv`

df <- read.csv("auto_data.csv")
knitr::kable(head(df, 20))

make	price	mpg	rep78	headroom	trunk	weight	length	turn	displacement	gear_ratio	foreign
Cad. Seville	15906	21	3	3.0	13	4290	204	45	350	2.24	Domestic
Cad. Eldorado	14500	14	2	3.5	16	3900	204	43	350	2.19	Domestic
Linc. Mark V	13594	12	3	2.5	18	4720	230	48	400	2.47	Domestic
Linc. Versailles	13466	14	3	3.5	15	3830	201	41	302	2.47	Domestic
Peugeot 604	12990	14	NA	3.5	14	3420	192	38	163	3.58	Foreign
Volvo 260	11995	17	5	2.5	14	3170	193	37	163	2.98	Foreign
Linc. Continental	11497	12	3	3.5	22	4840	233	51	400	2.47	Domestic
Cad. Deville	11385	14	3	4.0	20	4330	221	44	425	2.28	Domestic
Buick Riviera	10372	16	3	3.5	17	3880	207	43	231	2.93	Domestic
Olds Toronado	10371	16	3	3.5	17	4030	206	43	350	2.41	Domestic
BMW 320i	9735	25	4	2.5	12	2650	177	34	121	3.64	Foreign
Audi 5000	9690	17	5	3.0	15	2830	189	37	131	3.20	Foreign
Olds 98	8814	21	4	4.0	20	4060	220	43	350	2.41	Domestic
Datsun 810	8129	21	4	2.5	8	2750	184	38	146	3.55	Foreign
Buick Electra	7827	15	4	4.0	20	4080	222	43	350	2.41	Domestic
VW Dasher	7140	23	4	2.5	12	2160	172	36	97	3.74	Foreign
VW Scirocco	6850	25	4	2.0	16	1990	156	36	97	3.78	Foreign
Plym. Sapporo	6486	26	NA	1.5	8	2520	182	38	119	3.54	Domestic
Dodge St. Regis	6342	17	2	4.5	21	3740	220	46	225	2.94	Domestic
Merc. XR-7	6303	14	4	3.0	16	4130	217	45	302	2.75	Domestic

Session 3: Basic Stata Operations and Data Manipulation

Script Walkthrough

Output Preview: auto_data.csv

Output Preview: `auto_data.csv`