Parsing Apple Health Data
Apple provides easy access to personal health data. The challenge is parsing the blob of data and extracting useful features.
Below is a protocol for streaming heart rate telemetry from a large xml
document into a simple csv
file on a personal linux laptop.
Tools:
Retrieve health data
- Go to health app on iphone
- Tap profile icon and scroll to bottom
- Select “export all health data” option
- Push
export.zip
to cloud storage - Retrieve and extract archive on local machine
The export process may take several minutes. Archive could be 10s of MB.
The extracted health data has the following structure:
export
└── apple_health_export
├── export_cda.xml
├── export.xml
└── workout-routes
├── route_2019-03-11_1.19pm.gpx
├── route_2019-03-12_1.29pm.gpx
├── route_2019-03-13_6.58pm.gpx
├── ......
All health telemetry is rolled up in the export.xml
file. This file can be 100s of MB.
Within export.xml
the HKQuantityTypeIdentifierHeartRate
record type holds heart rate data:
<Record type="HKQuantityTypeIdentifierHeartRate"
...
unit="count/min"
creationDate="2019-03-02 20:42:24 -0500"
startDate="2019-03-02 19:47:18 -0500"
endDate="2019-03-02 19:47:18 -0500"
value="79">
...
</Record>
Parse heart rate data with python
Each record in export.xml
can be parsed with the python xml api iterparse
function and filtered using the HKQuantityTypeIdentifierHeartRate
type.
Filtered heart rate records are then converted from xml
to json
and printed to a terminal for the next processing step.
Depending on the size of the export.xml
the python script may run for several minutes:
# parse.py
import json
import sys
from xml.etree.ElementTree import iterparse
for _, elem in iterparse(sys.argv[1]):
if elem.tag == "Record":
if elem.attrib["type"] == "HKQuantityTypeIdentifierHeartRate":
print(json.dumps(elem.attrib))
Stream heart rate json into csv with jq
Now that json
like heart rate records are being printed to the terminal they can be picked up in a unix style pipeline.
The pipeline will stream heart rate data into jq
to extract telemetry and then pump formatted data into a clean csv
.
The entire pipeline can be ran in a bash terminal:
python3 parse.py data/export.xml | \
jq -r '[.endDate, .type, .unit, .value] | @csv' \
> data/heart_rate.csv
Now the fully parsed and neatly formatted heart rate data can be picked up by pandas
and other tools for deeper analysis.
If other records need to be extracted:
- Modify the
parse.py
script to filter on a given type - Adjust the arguments in the
jq
command - Define a different
csv
name
While not the most performant, this method can reliably parse a large Apple health export.xml
on a personal linux laptop.