Databases for text storage
Backends
Issues
- Indexing
- Query language
- Schema
We'll show how to:
- Define an (indexed) collection (table)
- Store some tweets in it
- Run some queries to fetch those tweets
PostgresSQL
- SQL (SELECT * FROM tweets)
- Postgres (opensource & free, very fast, text indexing)
- Quick note on data types
DB setup
$ createdb texts
$ psql texts
CREATE TABLE users (
id bigint primary key,
screen_name text,
description text);
CREATE TABLE tweets (
id bigint primary key,
user_id bigint REFERENCES users(id),
retweet_of bigint,
timestamp timestamp,
text text);
- run the statements in
tweets.sql
to insert data.
CREATE INDEX text_idx ON tweets USING gin(to_tsvector('english', text));
Destroy database
$ dropdb texts
SQL commands to show:
SELECT * FROM tweets WHERE user_id=258022817;
SELECT * FROM tweets WHERE lower(text) LIKE '%obama%';
SELECT * FROM tweets WHERE timestamp > '2014-12-2';
SELECT tweets.timestamp, users.screen_name FROM tweets JOIN users ON tweets.user_id=users.id;
SELECT to_tsvector('english', 'hi there my name is jona i am a programmer and this is a database intro') @@ to_tsquery('programming');
SELECT * FROM tweets WHERE to_tsvector('english', text) @@ to_tsquery('obama');
CREATE INDEX text_idx ON tweets USING gin(to_tsvector('english', text));
% Now this will be faster...
SELECT * FROM tweets WHERE to_tsvector('english', text) @@ to_tsquery('obama');
SELECT user_id, count(*) FROM tweets GROUP BY user_id;
SELECT users.screen_name, count(*) FROM tweets JOIN users ON users.id=tweets.user_id GROUP BY users.screen_name, user_id;
rpostgresql
install.packages("RPostgreSQL")
library("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="texts")
rs <- dbSendQuery(con, "select * from tweets")
tweets <- fetch(rs, n=10)
MongoDB
db.tweets.find()
- opensource, free, and very fast
- no schema!
mongoexport -h smapp-politics:27011 -d OccupyWallStreet -c tweets -u smapp_readOnly -p 'xxxxxxxx' | mongoimport -d OccupyWallStreet -c tweets
db.tweets.ensureIndex({text: "text" })
db.tweets.findOne({$text: {$search: 'obama'}})
db.tweets.aggregate({$group: {_id: "$user.screen_name", number: {$sum: 1}}})
RMongo
install.packages('RMongo')
library(RMongo)
conn <- mongoDbConnect('OccupyWallStreet')
query <- dbGetQuery(conn, 'tweets', '{$text: {$search: "obama"}}')
data1 <- query[c('timestamp', 'text')]
Links