Posts

Showing posts from August, 2020

Data Scrapping

Image
  Monday 31 August was a guest lecture at  Sabudh Foundation  where we were introduced to the concept of  Data Scraping . What is Data Scraping? Data scraping , also known as   web scraping , is th e process of importing  information  from a  website  into a csv or any file on local. It’s one of the most efficient ways to get data from the web, and in some cases to channel that data to another website.  Popular uses of data scraping include: Research for web content/business intelligence Pricing for travel booking sites or price comparison sites Finding sales leads or conducting market research by crawling public data sources (e.g. Yell and Twitter) Sending product data from an e-commerce site to another online vendor (e.g. Google Shopping) Scraping using Python There are numerous number of packages available for web scraping in python but we only need a handful of them in order to scrape almost any site. Some of these libraries are are named below: Requests BeautifulSoup lxml Selenium

Text Analysis (Part-5)

Image
  The last lecture on Tuesday was the completion of the contribution from the  Bayesian's,  the next lectures will be based on the contributions from the  Neural Scientists i.e The Word Embedding . This blog consist of the discussion over one of the word embedding technique called as  Word2Vec. What is Word Embedding? Word embedding is one of the most popular representation of document vocabulary.   T hey are vector representations of a particular word  that is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc. Need of Word Embedding The need of the  word embedding  can be understood by the following example of sentences. Sentence 1:  Have a good day  Sentence 2:  Have a great day.  The two sentences hardly have different meaning. If we construct a vocabulary out of it we get a set  V  as  V = {Have, a, good, great, day} Creating the one-hot encoded vector for each of these words in V. Length of our one-hot encod

Text Analysis (Part - 4)

Image
  The article covers a very important concepts of Text Analysis from Bayesian's i.e the  Latent Dirichlet Allocation  which is the part of topic modelling . What is Topic Modelling? Topic modelling is a type of statistical modelling for discovering the abstract “topics” that occur in a collection of documents. This can be useful for search engines, customer service automation, and any other instance where knowing the topics of documents is important . Thus, Topic Models , in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts. There are multiple methods of for topic modelling and one of them includes Latent Dirichlet Allocation (LDA) . Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modelled as Dirichlet distributions. Basic Idea LDA is a  genera

Text Analysis (Part-3)

Image
  The article covers a very important concepts of Text Analysis from Bayesian's i.e the  Latent Dirichlet Allocation  which is the part of topic modelling . What is Topic Modelling? Topic modelling is a type of statistical modelling for discovering the abstract “topics” that occur in a collection of documents. This can be useful for search engines, customer service automation, and any other instance where knowing the topics of documents is important . Thus, Topic Models , in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts. There are multiple methods of for topic modelling and one of them includes Latent Dirichlet Allocation (LDA) . Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modelled as Dirichlet distributions. Basic Idea LDA is a  genera

Text Analysis(Part-2)

Image
  Previous  lecture / blog was the introduction to  Text Analysis,  from this blog we are diving deep into the concepts of Text Analysis. This blog contains some of the common methods of Text Analysis that were introduced in the previous lectures TERM FREQUENCY INVERSE DOCUMENT FREQUENCY - TFIDF It is the very basic and simple implementation of text analysis, it is just the improvisation of the binary vectorizer and count vectorizer in which the table values are  tfidf score for each word  instead of  the word count thus is called as  tfidf vectorizer .  Idea TFIDF uses the term frequency and document frequency in order to get the importance of the word. It  says that  when the  term frequency  increases it intuits that the particular word is important in characterising the document and document is centred around it. On the other hand as the  document frequency  increases the comparison between two documents become difficult which intuits that the word is too common and will not help i