Geocoding Application with Vector Database and Sentence Transformers: LanceDB, Pgvector
Content
- Introduction
- Data Preparation and Vector Embedding
- Create Vector Database
- Summary
Introduction
On February 6, 2023, an earthquake struck Turkey and affected 10 cities. Thousands of people were injured or lost their lives.
The country mobilized for emergency response immediately after the earthquake. However, many problems arose. One of these was the lack of sufficient geographic data. For example, building data was not accessible. People on social media were sharing the addresses of their loved ones and asking for help. However, paid services had to be used to obtain the locations of the addresses. Most of these services did not have up-to-date data or did not have any data at all.
The implementation of Geographic Information Systems (GIS) plays a crucial role in enhancing the efficiency of emergency response efforts following an earthquake. In this article, I aim to develop an application that leverages vector databases to query address and location data. While acknowledging that the application may have limitations or areas of incompleteness, my primary objective is to conceptualize an innovative solution that can contribute to disaster response strategies.
Data Preparation and Vector Embedding
We will make an application using data from cities affected by the earthquake.
To install the libraries for vector embeddings, we need to run the following command:
pip install pandas geopandas pyarrow transformer torch
If you get an error while installing the pytorch library, you can download it from here.
Let’s import the libraries we installed:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
import geopandas as gpd
Reading data:
df = pd.read_csv("bina.csv")
df.columns
"""
['il', 'ilce', 'mahalle', 'sokak', 'binaNo', 'x', 'y']
"""
Coordinate system transformation:
gdf = gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df["x"],df["y"]),crs=4326)
gdf = gdf.to_crs("EPSG:3857")
df["x"] = gdf.geometry.x
df["y"] = gdf.geometry.y
Combine address information and create vectors from coordinates:
df["adres"] = df[['il', 'ilce', 'mahalle', 'sokak', 'binaNo']].apply(lambda x: f"{x['mahalle']} mahallesi {x['sokak']} no {x['binaNo']} {x['ilce']} {x['il']} ",axis=1)
df["vector_geom"] = df[["x","y"]].apply(lambda x: [x["x"],x["y"]],axis=1)
df.to_csv("bina.csv",index=False)
df.sample(10)
I have tried many sentence transformers models. The model below with Hugging Face gave the best results. We are using the following model to vectorize the address information:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
torch.cuda.set_device("cuda:0")
device = "cuda:0"
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(
-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
tokenizer = AutoTokenizer.from_pretrained(
'emrecan/bert-base-turkish-cased-mean-nli-stsb-tr')
model = AutoModel.from_pretrained(
'emrecan/bert-base-turkish-cased-mean-nli-stsb-tr').to(device)
def encode(adres):
encoded_input = tokenizer(
adres, padding=True, truncation=True, return_tensors='pt').to(device)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(
model_output, encoded_input['attention_mask'])
return sentence_embeddings.tolist()[0]
data = pd.read_csv("bina.csv")["adres"].values
vector = []
print("embedding")
for i in data:
embed = encode(i)
vector.append(embed)
del data
print("embeded")
data_vector = pd.DataFrame()
data_vector["vector"] = vector
data_vector.to_parquet("vector-adres.parquet")
Create Vector Database
In this section, we will create an application with LanceDB and Pgvector.
LanceDB
LanceDB is an open-source vector database for AI that’s designed to store, manage, query and retrieve embeddings on large-scale multi-modal data. The core of LanceDB is written in Rust and is built on top of Lance, an open-source columnar data format designed for performant ML workloads and fast random access.
To install the libraries:
pip install lancedb
Import libraries:
import lancedb
import pyarrow as pa
import pandas as pd
df = pd.read_csv("bina.csv")
df_vec = pd.read_parquet("vector-adres.parquet")
df["vector_adres"] = df_vec["vector"].values
del df_vec
We create a schema to create our table in the LanceDB database. Then we add our dataframe. We determine the vector dimensions. The model I use is size 768. The vector to which I added the location data is in 2 dimensions.
schema = pa.schema(
[
pa.field("il", pa.string()),
pa.field("ilce", pa.string()),
pa.field("mahalle", pa.string()),
pa.field("sokak", pa.string()),
pa.field("binaNo", pa.string()),
pa.field("x", pa.float32()),
pa.field("y", pa.float32()),
pa.field("adres", pa.string()),
pa.field("vector_geom", pa.list_(pa.float32(), 2)),
pa.field("vector_adres", pa.list_(pa.float32(), 768)),
])
db = lancedb.connect("db")
db.create_table("bina",df.dropna(),exist_ok=True,on_bad_vectors="drop",schema=schema)
Open the table:
table = db.open_table("bina")
We will try to find the closest points according to the location information provided.
import math
res = table.search([4025368.560015853, 4328570.242365282],
vector_column_name="vector_geom").metric("L2").limit(5).to_pandas()
res["_distance"] = res["_distance"].apply(lambda x: math.sqrt(x))
res
Now let’s find the closest addresses according to the written address.
adres = "General Şükrü Kanatlı mahallesi Atatürk no 2 Antakya Hatay"
vec = encode(adres=adres)
res = table.search(vec, vector_column_name="vector_adres").metric(
"L2").limit(5).to_pandas()
res
Pgvector
Pgvector is open-source vector similarity search for Postgres.
To install pgvector with docker:
docker pull pgvector/pgvector:pg16
docker run -d --name pgvector -p 5434:5432 -e POSTGRES_PASSWORD=admin -v pgvector_data:/var/lib/postgresql pgvector/pgvector:pg16
Let’s connect to the database and install the Pgvector extension. Then let’s create our table.
CREATE EXTENSION vector;
CREATE TABLE public.bina (
il text NULL,
ilce text NULL,
mahalle text NULL,
sokak text NULL,
binano text NULL,
x float NULL,
y float NULL,
adres text NULL,
vector_geom vector(2) NULL,
vector_adres vector(768) NULL
);
Let’s add our data to the table we created in the database.
To install the libraries:
pip install sqlalchemy asyncpg psycopg2
from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy import String, Integer, Table, Column, MetaData, FLOAT
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy import URL
import pandas as pd
url_object = URL.create(
"postgresql+asyncpg",
host="localhost",
port="5434",
database="postgres",
username="postgres",
password="admin",
)
engine = create_async_engine(
url_object,
echo=True,
)
meta = MetaData()
t1 = Table(
"bina",
meta,
Column("il", String),
Column("ilce", String),
Column("mahalle", String),
Column("sokak", String),
Column("binano", String),
Column("x", FLOAT),
Column("y", FLOAT),
Column("adres", String),
Column("vector_geom", ARRAY(FLOAT)),
Column("vector_adres", ARRAY(FLOAT)),
)
data = df.values
count = 0
while True:
if count > len(data):
break
values = data[count:count + 1000]
rows = []
for i in values:
row = {
"il": i[0],
"ilce": i[1],
"mahalle": i[2],
"sokak": i[3],
"binano": i[4],
"x": i[5],
"y": i[6],
"adres": i[7],
"vector_geom": i[8],
"vector_adres": i[9].tolist()
}
rows.append(row)
async with engine.begin() as conn:
await conn.execute(
t1.insert(), rows
)
count += 1000
Now let’s create an index to speed up our queries. There are two index types: HNSW and IVFFlat.
Since my computer’s RAM is not enough, I will choose IVFFlat :) If your RAM is sufficient, HNSW is a better option.
To create index for both vector columns:
CREATE INDEX idx_ivfflat_l2_geom_bina ON public.bina USING ivfflat (vector_geom vector_l2_ops) WITH (lists = 100);
CREATE INDEX idx_ivfflat_l2_adres_bina ON public.bina USING ivfflat (vector_adres vector_l2_ops) WITH (lists = 100);
Let’s create the same query for pgvector to find the closest location. We will use l2 distance for distance compute.
from sqlalchemy import create_engine
from sqlalchemy import URL
import pandas as pd
url_object = URL.create(
"postgresql+psycopg2",
host="localhost",
port="5434",
database="postgres",
username="postgres",
password="admin",
)
engine = create_engine(
url_object,
echo=True,
)
query = """
SELECT il, ilce, mahalle, sokak, binano, adres, l2_distance(bina.vector_geom,ARRAY[4025368.560015853, 4328570.242365282]::vector) distance
FROM bina
ORDER BY bina.vector_geom <-> ARRAY[4025368.560015853, 4328570.242365282]::vector
LIMIT 5;
"""
df_query = pd.read_sql(query,con=engine)
df_query
Now let’s find the closest addresses according to the written address.
vec = encode(adres="General Şükrü Kanatlı mahallesi Atatürk no 2 Antakya Hatay")
query = f"""
SELECT il, ilce, mahalle, sokak, binano, adres, l2_distance(bina.vector_adres,ARRAY{vec}::vector) distance
FROM bina
ORDER BY bina.vector_adres <-> ARRAY{vec}::vector
LIMIT 5;
"""
df_query = pd.read_sql(query,con=engine)
df_query
Or we can query only by neighborhood name.
vec = encode(adres="General Şükrü Kanatlı mah Antakya Hatay")
query = f"""
SELECT il,ilce,mahalle ,sokak,binano ,adres ,l2_distance(bina.vector_adres,ARRAY{vec}::vector) distance
FROM bina
ORDER BY bina.vector_adres <-> ARRAY{vec}::vector
LIMIT 15;
"""
df_query = pd.read_sql(query,con=engine)
df_query
Summary
We tried to create a geocoding application with Pgvector and LanceDB. We got the most similar results to the address we wanted to find. We found the closest points to a point using Euclidean distance with vector databases.
These tools have proven to be invaluable assets for developers, offering high performance and rapid query responses — often within milliseconds. Additionally, both Pgvector and LanceDB boast seamless integration with a wide array of prominent AI tools, enabling me to select the most suitable technology based on the specific requirements of my projects.
You can choose LanceDB to quickly create serverless and cost-effective applications. If you don’t want to leave the Postgresql ecosystem like me, Pgvector will be a good choice :)
Thank you for reading. I would be grateful for your feedback. You can reach me via LinkedIn.
See you in the next article.
Resources
Pgvector: https://github.com/pgvector/pgvector
LanceDB: https://lancedb.github.io/lancedb/
Model: https://huggingface.co/emrecan/bert-base-turkish-cased-mean-nli-stsb-tr
Sentence Transformers: https://www.sbert.net/