osmar2 is a version of the osmar package (“OpenStreetMap and R”) rewritten to use xml2 package instead of XML, and some other adaptations to allow for much quicker reading of .osm files. It’s available on github.
The part I wanted to spend some time writing about here was the
optimization and debugging process that went into making it faster.
Going into this project, I was pretty sure that the package there were
inefficiencies in reading files because even though the files were
relatively small (10Mb), osmar was already taking several minutes to
load, whereas a read.csv()
(or better, readr::read_csv()
) could read
that size in less than a second.
So, I first thought that just switching over to xml2 would be most of what I needed, but when I had finished rewriting the code using the original osmar package’s approach updated with xml2 funcitons instead of XML, there wasn’t much of an improvement.What ended up speeding things up was a change in the parsing algorithm.
Here’s a sample osm xml file, which I’ve simplified so that only has a few nodes and have removed some tags that aren’t used (more details on osm files here).
#> ...
#> <node id="506685546" lat="44.9493365" lon="-93.1692413" version="1"/>
#> <node id="506685547" lat="44.9496459" lon="-93.1691436" version="1">
#> <tag k="highway" v="turning_circle"/>
#> </node>
#> <node id="530739183" lat="44.9399208" lon="-93.1875805" version="1">
#> <tag k="name" v="Davanni's Pizza & Hot Hoagies"/>
#> <tag k="amenity" v="restaurant"/>
#> <tag k="cuisine" v="italian"/>
#> </node>
#> ...
One of the slowest part of parsing the file in the original osmar
package is when creating a dataset of the tag variables. This dataset
needs one row per tag
node, but includes a variable from that tag
’s
parent node
(the id
variable). id
from the node. The original way
I attacked this looked something like this:
old_method <- function() {
# Find all of the nodes with at least one tag
nodes <- xml_find_all(osm_xml_root, "/osm/node[./tag]")
# Pull those nodes' ids
ids <- xml_attr(nodes, "id")
# Pull those nodes' (1 or more) tags into a list of data.frames()
tags <- lapply(nodes, function(parent_node) {
tag_nodes <- xml_find_all(parent_node, "./tag")
data_frame(k = xml_attr(tag_nodes, "k"),
v = xml_attr(tag_nodes, "v"))
})
# And convert that to a single long data.frame
out_df <- data_frame(ids,
tags = tags) %>%
unnest()
out_df
}
old <- old_method()
old %>% print(n = 3)
#> # A tibble: 4,616 x 3
#> ids k v
#> <chr> <chr> <chr>
#> 1 187845021 highway mini_roundabout
#> 2 187854927 highway mini_roundabout
#> 3 187854980 noexit yes
#> # ... with 4,613 more rows
As I began to dig into this, I learned that some of the xml2 functions allow you to perform the entire operations in C and so are very fast, while others force the code into R objects and so have much slower looping. So, it doesn’t always make sense to try to save your place in the xml document to avoid having to repeatedly search for the same spots, because this may force you to loop in R rather than C. For example:
system.time(
osm_xml_root %>%
xml_find_all("/osm/node") %>%
xml_find_all("tag/@k") %>%
xml_text()
)
#> user system elapsed
#> 3.005 0.029 3.035
# Is 30X slower than this:
system.time(
osm_xml_root %>%
xml_find_all("/osm/node/tag/@k") %>%
xml_text()
)
#> user system elapsed
#> 0.120 0.003 0.123
# So even though the second method requires us to search from the top
# of the tree for the k and v variable separately, it is still faster.
The only problem being that we need a way to find out which node id this particular key/value belong to. This requires some tricky(-ish) xpath.
# Find all nodes with at least one tag and get their id
nodes <- xml_find_all(osm_xml_root, "/osm/node[./tag]")
id <- xml_attr(nodes, "id")
# Find out how many tags are below each of these nodes and
# repeat the id that many times
lens <- xml_find_num(nodes, "count(./tag)")
id <- rep(id, lens)
Putting it all together, we get:
new_method <- function() {
# Find all of the nodes with at least one tag and pull their ids
nodes <- xml_find_all(osm_xml_root, "/osm/node[./tag]")
ids <- xml_attr(nodes, "id")
# Find out how many tags are below each of these nodes and
# repeat the id that many times
lens <- xml_find_num(nodes, "count(./tag)")
ids <- rep(ids, lens)
# Pull the tag's keys and values
keys <- osm_xml_root %>%
xml_find_all("/osm/node/tag/@k") %>%
xml_text()
values <- osm_xml_root %>%
xml_find_all("/osm/node/tag/@v") %>%
xml_text()
data_frame(ids = ids,
k = keys,
v = values)
}
new <- new_method()
new %>% print(n = 3)
#> # A tibble: 4,616 x 3
#> ids k v
#> <chr> <chr> <chr>
#> 1 187845021 highway mini_roundabout
#> 2 187854927 highway mini_roundabout
#> 3 187854980 noexit yes
#> # ... with 4,613 more rows
Which gives the same result, but much more quickly than the old method:
microbenchmark(old = old <- old_method(),
new = new <- new_method(),
times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> old 3648.1691 3697.593 3906.4411 3949.5129 4015.202 4221.7289 5
#> new 354.0688 374.710 380.0945 378.4144 380.622 412.6573 5
identical(old, new)
#> [1] TRUE
This makes me wonder whether it’d be easier for xml2 to have a function
along the lines of xml_find_list()
so that you could more easily
select multiple queries using the quicker C code. I’m not exactly sure
how this would work though. Also, possibly I’m missing something and
there’s an easier way to do this. If there is, let me
know.